ArXiv Digest: Drug, Cosmetic & Veterinary Science (EN–VI) - 2025-11-11

MolChord: Structure-Sequence Alignment for Protein-Guided Drug Design.
EN: MolChord, a new approach for structure-based drug design (SBDD), aligns protein structures with molecular structures by integrating a structure encoder and NatureLM, a large language model unifying text, molecules, and proteins. It uses Direct Preference Optimization (DPO) on a curated, property-aware dataset to guide molecules towards desired pharmacological properties. Evaluated on CrossDocked2020, MolChord achieved state-of-the-art performance, demonstrating its potential for practical drug discovery applications.
VI: MolChord, một phương pháp mới cho thiết kế thuốc dựa trên cấu trúc (SBDD), căn chỉnh cấu trúc protein với cấu trúc phân tử bằng cách tích hợp bộ mã hóa cấu trúc và NatureLM, một mô hình ngôn ngữ lớn thống nhất văn bản, phân tử và protein. Nó sử dụng Direct Preference Optimization (DPO) trên một tập dữ liệu được tuyển chọn kỹ lưỡng, nhận biết thuộc tính, để hướng dẫn các phân tử đạt được các đặc tính dược lý mong muốn. Được đánh giá trên CrossDocked2020, MolChord đạt được hiệu suất tốt nhất, chứng minh tiềm năng ứng dụng thực tế trong khám phá thuốc.
BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs.
EN: This study benchmarks large language models (LLMs) for biomedical coreference resolution using the CRAFT corpus. It compares LLMs against SpanBERT using four prompting strategies that incorporate domain-specific information. While LLMs show promise, especially with entity-augmented prompts (specifically LLaMA 8B and 17B), they struggle with long-range dependencies and ambiguous mentions, suggesting that further refinement is needed for robust performance in this domain. The findings highlight the potential of prompt engineering to improve LLM performance in biomedical NLP.
VI: Nghiên cứu này đánh giá hiệu suất của các mô hình ngôn ngữ lớn (LLMs) trong việc giải quyết tham chiếu đồng (coreference resolution) trong lĩnh vực y sinh sử dụng bộ dữ liệu CRAFT. Nghiên cứu so sánh LLMs với SpanBERT thông qua bốn chiến lược nhắc (prompting) khác nhau, kết hợp thông tin đặc thù của lĩnh vực. Mặc dù LLMs cho thấy nhiều hứa hẹn, đặc biệt khi sử dụng nhắc tăng cường thực thể (ví dụ: LLaMA 8B và 17B), chúng vẫn gặp khó khăn với các phụ thuộc tầm xa (long-range dependencies) và các đề cập mơ hồ (ambiguous mentions), cho thấy cần phải tinh chỉnh thêm để đạt hiệu suất mạnh mẽ trong lĩnh vực này. Kết quả nhấn mạnh tiềm năng của kỹ thuật nhắc (prompt engineering) trong việc cải thiện hiệu suất của LLMs trong xử lý ngôn ngữ tự nhiên (NLP) y sinh.
BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs.
EN: This study benchmarks large language models (LLMs) for biomedical coreference resolution using the CRAFT corpus. It compares LLMs against SpanBERT using four prompting strategies that incorporate domain-specific information. While LLMs show promise, especially with entity-augmented prompts (specifically LLaMA 8B and 17B), they struggle with long-range dependencies and ambiguous mentions, suggesting that further refinement is needed for robust performance in this domain. The findings highlight the potential of prompt engineering to improve LLM performance in biomedical NLP.
VI: Nghiên cứu này đánh giá hiệu suất của các mô hình ngôn ngữ lớn (LLMs) trong việc giải quyết tham chiếu đồng (coreference resolution) trong lĩnh vực y sinh sử dụng bộ dữ liệu CRAFT. Nghiên cứu so sánh LLMs với SpanBERT thông qua bốn chiến lược nhắc (prompting) khác nhau, kết hợp thông tin đặc thù của lĩnh vực. Mặc dù LLMs cho thấy nhiều hứa hẹn, đặc biệt khi sử dụng nhắc tăng cường thực thể (ví dụ: LLaMA 8B và 17B), chúng vẫn gặp khó khăn với các phụ thuộc tầm xa (long-range dependencies) và các đề cập mơ hồ (ambiguous mentions), cho thấy cần phải tinh chỉnh thêm để đạt hiệu suất mạnh mẽ trong lĩnh vực này. Kết quả nhấn mạnh tiềm năng của kỹ thuật nhắc (prompt engineering) trong việc cải thiện hiệu suất của LLMs trong xử lý ngôn ngữ tự nhiên (NLP) y sinh.
Benchmarking a foundation potential against quantum chemistry methods for predicting molecular redox potentials.
EN: This study benchmarks the MACE-OMol-0 foundation potential (FP) against DFT for predicting molecular redox potentials in electron transfer (ET) and proton-coupled electron transfer (PCET) reactions. The FP demonstrates high accuracy for PCET, comparable to DFT. However, its performance decreases for ET, especially multi-electron transfers with underrepresented ions in the training data. The authors propose a hybrid workflow: FP for geometry optimization and thermochemistry, followed by DFT single-point energy refinement and solvation correction. This improves accuracy and scalability for high-throughput screening of redox-active molecules.
VI: Nghiên cứu này so sánh tiềm năng cơ bản (FP) MACE-OMol-0 với DFT để dự đoán điện thế oxy hóa khử của phân tử trong các phản ứng truyền điện tử (ET) và truyền điện tử ghép proton (PCET). FP cho thấy độ chính xác cao đối với PCET, tương đương với DFT. Tuy nhiên, hiệu suất giảm đối với ET, đặc biệt là truyền nhiều điện tử với các ion ít được đại diện trong dữ liệu huấn luyện. Các tác giả đề xuất một quy trình làm việc kết hợp: FP để tối ưu hóa hình học và nhiệt hóa học, sau đó là tinh chỉnh năng lượng một điểm DFT và hiệu chỉnh dung môi. Điều này cải thiện độ chính xác và khả năng mở rộng để sàng lọc thông lượng cao các phân tử hoạt động oxy hóa khử.
Benchmarking a foundation potential against quantum chemistry methods for predicting molecular redox potentials.
EN: This study benchmarks the MACE-OMol-0 foundation potential (FP) against DFT for predicting molecular redox potentials in electron transfer (ET) and proton-coupled electron transfer (PCET) reactions. The FP demonstrates high accuracy for PCET, comparable to DFT. However, its performance decreases for ET, especially multi-electron transfers with underrepresented ions in the training data. The authors propose a hybrid workflow: FP for geometry optimization and thermochemistry, followed by DFT single-point energy refinement and solvation correction. This improves accuracy and scalability for high-throughput screening of redox-active molecules.
VI: Nghiên cứu này so sánh tiềm năng cơ bản (FP) MACE-OMol-0 với DFT để dự đoán điện thế oxy hóa khử của phân tử trong các phản ứng truyền điện tử (ET) và truyền điện tử ghép proton (PCET). FP cho thấy độ chính xác cao đối với PCET, tương đương với DFT. Tuy nhiên, hiệu suất giảm đối với ET, đặc biệt là truyền nhiều điện tử với các ion ít được đại diện trong dữ liệu huấn luyện. Các tác giả đề xuất một quy trình làm việc kết hợp: FP để tối ưu hóa hình học và nhiệt hóa học, sau đó là tinh chỉnh năng lượng một điểm DFT và hiệu chỉnh dung môi. Điều này cải thiện độ chính xác và khả năng mở rộng để sàng lọc thông lượng cao các phân tử hoạt động oxy hóa khử.
Benchmarking a foundation potential against quantum chemistry methods for predicting molecular redox potentials.
EN: This study benchmarks the MACE-OMol-0 foundation potential (FP) against DFT for predicting molecular redox potentials in electron transfer (ET) and proton-coupled electron transfer (PCET) reactions. The FP demonstrates high accuracy for PCET, comparable to DFT. However, its performance decreases for ET, especially multi-electron transfers with underrepresented ions in the training data. The authors propose a hybrid workflow: FP for geometry optimization and thermochemistry, followed by DFT single-point energy refinement and solvation correction. This improves accuracy and scalability for high-throughput screening of redox-active molecules.
VI: Nghiên cứu này so sánh tiềm năng cơ bản (FP) MACE-OMol-0 với DFT để dự đoán điện thế oxy hóa khử của phân tử trong các phản ứng truyền điện tử (ET) và truyền điện tử ghép proton (PCET). FP cho thấy độ chính xác cao đối với PCET, tương đương với DFT. Tuy nhiên, hiệu suất giảm đối với ET, đặc biệt là truyền nhiều điện tử với các ion ít được đại diện trong dữ liệu huấn luyện. Các tác giả đề xuất một quy trình làm việc kết hợp: FP để tối ưu hóa hình học và nhiệt hóa học, sau đó là tinh chỉnh năng lượng một điểm DFT và hiệu chỉnh dung môi. Điều này cải thiện độ chính xác và khả năng mở rộng để sàng lọc thông lượng cao các phân tử hoạt động oxy hóa khử.
AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions.
EN: This study introduces a Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) model to reconstruct continuous, high-resolution (0.25 degree) maps of atmospheric CO2 (XCO2) from OCO-2 satellite data over southern Canada, focusing on poultry regions. ST-ViWT fuses wavelet transformations with transformer attention using meteorological data, vegetation indices, topography, and land cover. The model demonstrates high accuracy (R2 = 0.984, RMSE = 0.468 ppm) and robust generalization against TCCON data (bias = -0.14 ppm, r = 0.928). Analysis reveals a positive correlation between poultry facility density and XCO2. The ST-ViWT framework offers improved CO2 mapping compared to traditional methods and supports the integration of satellite data with national inventories and precision livestock platforms for enhanced carbon accounting, hotspot identification, and mitigation assessment in agricultural landscapes.
VI: Nghiên cứu này giới thiệu mô hình Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) để tái tạo bản đồ liên tục có độ phân giải cao (0,25 độ) về CO2 trong khí quyển (XCO2) từ dữ liệu vệ tinh OCO-2 trên miền nam Canada, tập trung vào các vùng chăn nuôi gia cầm. ST-ViWT kết hợp phép biến đổi wavelet với cơ chế attention transformer sử dụng dữ liệu khí tượng, chỉ số thực vật, địa hình và độ che phủ đất. Mô hình cho thấy độ chính xác cao (R2 = 0.984, RMSE = 0.468 ppm) và khả năng tổng quát hóa mạnh mẽ so với dữ liệu TCCON (độ lệch = -0.14 ppm, r = 0.928). Phân tích cho thấy mối tương quan dương giữa mật độ cơ sở chăn nuôi gia cầm và XCO2. Khung ST-ViWT cung cấp khả năng lập bản đồ CO2 được cải thiện so với các phương pháp truyền thống và hỗ trợ tích hợp dữ liệu vệ tinh với kiểm kê quốc gia và các nền tảng chăn nuôi chính xác để tăng cường hạch toán carbon, xác định điểm nóng và đánh giá giảm thiểu trong cảnh quan nông nghiệp.
Kinetic theory of emulsions with matter supply.
EN: This work presents a kinetic theory for emulsions with continuous matter supply, extending the LSW theory. The study analyzes droplet growth under diffusion-limited and interface-resistance-limited conditions, considering both maintained supersaturation and constant matter supply scenarios. Key findings include decoupling and narrowing of droplet size distribution with maintained supersaturation (diffusion-limited), drifting distribution (interface-resistance-limited), and a transition between narrowing and broadening with constant matter supply (diffusion-limited). A universal coarsening law is found for interface-resistance-limited growth with constant matter supply, with average radius evolving according to a power law independent of matter supply, and a derived closed-form expression for droplet size distribution. The theory is relevant to biomolecular condensates in living cells.
VI: Nghiên cứu này trình bày một lý thuyết động học cho các nhũ tương với nguồn cung cấp vật chất liên tục, mở rộng lý thuyết LSW. Nghiên cứu phân tích sự phát triển của giọt trong điều kiện giới hạn khuếch tán và giới hạn điện trở bề mặt, xem xét cả kịch bản duy trì độ bão hòa quá mức và cung cấp vật chất liên tục. Các phát hiện chính bao gồm sự tách rời và thu hẹp phân bố kích thước giọt khi duy trì độ bão hòa quá mức (giới hạn khuếch tán), phân bố trôi (giới hạn điện trở bề mặt) và sự chuyển đổi giữa thu hẹp và mở rộng khi cung cấp vật chất liên tục (giới hạn khuếch tán). Một quy luật thô hóa phổ quát được tìm thấy cho sự phát triển giới hạn điện trở bề mặt với nguồn cung cấp vật chất liên tục, với bán kính trung bình phát triển theo quy luật lũy thừa độc lập với nguồn cung cấp vật chất và biểu thức dạng đóng được suy ra cho hàm phân bố kích thước giọt. Lý thuyết này có liên quan đến các chất ngưng tụ sinh học trong tế bào sống.
Assessing the Feasibility of Early Cancer Detection Using Routine Laboratory Data: An Evaluation of Machine Learning Approaches on an Imbalanced Dataset.
EN: This study investigates the feasibility of early cancer detection in dogs using routine lab data from the Golden Retriever Lifetime Study, addressing class imbalance and diverse cancer types. 126 machine learning pipelines were tested, with the best model (Logistic Regression with class weighting and recursive feature elimination) achieving an AUROC of 0.815 but low F1-score (0.25) and PPV (0.15). Despite a high NPV (0.98), low recall (0.79) prevents clinical utility. SHAP analysis revealed non-specific features driving predictions. The study concludes that routine lab data alone are insufficient for reliable early cancer detection, highlighting the need for multi-modal data integration in veterinary oncology R&D.
VI: Nghiên cứu này điều tra tính khả thi của việc phát hiện sớm ung thư ở chó bằng cách sử dụng dữ liệu xét nghiệm thường quy từ Nghiên cứu Vòng đời Chó Golden Retriever, giải quyết vấn đề mất cân bằng dữ liệu và nhiều loại ung thư khác nhau. 126 quy trình machine learning đã được kiểm tra, với mô hình tốt nhất (Logistic Regression với trọng số lớp và loại bỏ đặc trưng đệ quy) đạt AUROC là 0.815 nhưng điểm F1 thấp (0.25) và PPV (0.15). Mặc dù NPV cao (0.98), nhưng độ thu hồi thấp (0.79) ngăn cản tính ứng dụng lâm sàng. Phân tích SHAP cho thấy các đặc trưng không đặc hiệu thúc đẩy dự đoán. Nghiên cứu kết luận rằng chỉ dữ liệu xét nghiệm thường quy là không đủ để phát hiện sớm ung thư một cách đáng tin cậy, nhấn mạnh sự cần thiết của việc tích hợp dữ liệu đa phương thức trong R&D ung thư học thú y.
Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine.
EN: This study introduces Ladder-base, a novel Large Language Model (LLM) for Traditional Chinese Medicine (TCM). It uses Group Relative Policy Optimization (GRPO) to improve reasoning and factual accuracy. Trained on the TCM-Ladder dataset using the Qwen2.5-7B-Instruct base model, Ladder-base outperforms both general-purpose (GPT-4, Gemini 2.5, Claude 3, Qwen3) and TCM-specific LLMs (BenTsao, HuatuoGPT2, Zhongjing) across various reasoning metrics. The findings demonstrate GRPO's effectiveness in aligning LLMs with expert-level reasoning in TCM, paving the way for reliable TCM AI applications.
VI: Nghiên cứu này giới thiệu Ladder-base, một Mô hình Ngôn ngữ Lớn (LLM) mới cho Y học Cổ truyền Trung Quốc (TCM). Nó sử dụng Tối ưu hóa Chính sách Tương đối theo Nhóm (GRPO) để cải thiện khả năng suy luận và tính chính xác thực tế. Được huấn luyện trên tập dữ liệu TCM-Ladder sử dụng mô hình nền tảng Qwen2.5-7B-Instruct, Ladder-base vượt trội hơn cả các LLM đa mục đích (GPT-4, Gemini 2.5, Claude 3, Qwen3) và LLM chuyên dụng cho TCM (BenTsao, HuatuoGPT2, Zhongjing) trên nhiều chỉ số suy luận khác nhau. Các phát hiện chứng minh tính hiệu quả của GRPO trong việc điều chỉnh LLM phù hợp với khả năng suy luận ở cấp độ chuyên gia trong TCM, mở đường cho các ứng dụng AI TCM đáng tin cậy.
Multi-Marginal Schrödinger Bridge Matching.
EN: This paper introduces Multi-Marginal Schrödinger Bridge Matching (MSBM), a new algorithm that extends iterative Markovian fitting (IMF) to solve the multi-marginal Schrödinger Bridge problem. MSBM infers continuous population dynamics from multiple discrete snapshots by robustly enforcing all intermediate marginal distributions. Experiments on synthetic and single-cell RNA sequencing data demonstrate MSBM's competitive or superior performance in capturing complex trajectories with improved computational efficiency, enabling a more accurate understanding of dynamic processes in areas like developmental biology and systems medicine.
VI: Bài báo này giới thiệu phương pháp Multi-Marginal Schrödinger Bridge Matching (MSBM), một thuật toán mới mở rộng kỹ thuật iterative Markovian fitting (IMF) để giải quyết bài toán Schrödinger Bridge với nhiều phân phối biên. MSBM suy luận động lực học quần thể liên tục từ nhiều ảnh chụp rời rạc bằng cách ràng buộc chặt chẽ tất cả các phân phối biên trung gian. Các thử nghiệm trên dữ liệu tổng hợp và dữ liệu giải trình tự RNA tế bào đơn cho thấy MSBM có hiệu suất cạnh tranh hoặc vượt trội trong việc nắm bắt các quỹ đạo phức tạp với hiệu quả tính toán được cải thiện, cho phép hiểu chính xác hơn về các quá trình động trong các lĩnh vực như sinh học phát triển và y học hệ thống.
Poultry Farm Intelligence: An Integrated Multi-Sensor AI Platform for Enhanced Welfare and Productivity.
EN: Poultry Farm Intelligence (PoultryFI) is a new, modular AI platform designed for small to medium-sized poultry farms to improve animal welfare and productivity. It integrates six AI modules for camera placement optimization, audio-visual monitoring of welfare indicators, real-time egg counting (achieving 100% accuracy on Raspberry Pi 5), production and profitability forecasting, and a recommendation system based on forecasts and weather data. Field trials showed robust anomaly detection and reliable short-term forecasting. PoultryFI offers a cost-effective solution for continuous monitoring and proactive decision-making, bridging the gap between isolated research tools and scalable farm-wide intelligence, thus enhancing both welfare and profitability.
VI: Poultry Farm Intelligence (PoultryFI) là một nền tảng AI dạng mô-đun mới, được thiết kế cho các trang trại gia cầm quy mô vừa và nhỏ để cải thiện phúc lợi động vật và năng suất. Nó tích hợp sáu mô-đun AI để tối ưu hóa vị trí camera, giám sát trực quan-âm thanh các chỉ số phúc lợi, đếm trứng theo thời gian thực (đạt độ chính xác 100% trên Raspberry Pi 5), dự báo sản xuất và lợi nhuận, và một hệ thống khuyến nghị dựa trên dự báo và dữ liệu thời tiết. Thử nghiệm thực tế cho thấy khả năng phát hiện dị thường mạnh mẽ và dự báo ngắn hạn đáng tin cậy. PoultryFI cung cấp một giải pháp hiệu quả về chi phí để giám sát liên tục và đưa ra quyết định chủ động, thu hẹp khoảng cách giữa các công cụ nghiên cứu riêng lẻ và trí tuệ toàn trang trại có khả năng mở rộng, do đó nâng cao cả phúc lợi và lợi nhuận.
Spatiotemporal Transformers for Predicting Avian Disease Risk from Migration Trajectories.
EN: This study introduces a Transformer-based model to predict avian disease risk at the destinations of migrating birds. It uses GPS tracking data, disease outbreak records, and geospatial data, processed with H3 encoding. The model predicts endpoint disease risk with high accuracy (0.9821), AUC (0.9803), AP (0.9299), and F1-score (0.8836). This suggests its utility in early warning systems for avian disease, supporting timely interventions.
VI: Nghiên cứu này giới thiệu mô hình dựa trên kiến trúc Transformer để dự đoán nguy cơ dịch bệnh gia cầm tại điểm đến của các loài chim di cư. Mô hình sử dụng dữ liệu theo dõi GPS, hồ sơ dịch bệnh và dữ liệu không gian địa lý, được xử lý bằng mã hóa H3. Mô hình dự đoán nguy cơ dịch bệnh tại điểm đến với độ chính xác cao (0.9821), AUC (0.9803), AP (0.9299) và F1-score (0.8836). Điều này cho thấy tiềm năng ứng dụng của mô hình trong các hệ thống cảnh báo sớm về dịch bệnh gia cầm, hỗ trợ các biện pháp can thiệp kịp thời.
Augmenting generative models with biomedical knowledge graphs improves targeted drug discovery.
EN: K-DREAM, a novel framework, augments diffusion-based generative models with biomedical knowledge graphs to improve targeted drug discovery. By embedding knowledge graph information, K-DREAM directs molecular generation toward candidates with improved binding affinities, predicted efficacy, and biological relevance. Experiments show K-DREAM outperforms existing models in generating drug candidates for specific and multiple targets, highlighting its potential for rational drug design and therapeutic development.
VI: K-DREAM, một khung làm việc mới, tăng cường các mô hình sinh dựa trên khuếch tán bằng đồ thị tri thức y sinh để cải thiện việc khám phá thuốc hướng đích. Bằng cách nhúng thông tin từ đồ thị tri thức, K-DREAM hướng việc tạo phân tử tới các ứng cử viên có ái lực liên kết được cải thiện, hiệu quả dự đoán và tính liên quan sinh học. Các thử nghiệm cho thấy K-DREAM vượt trội hơn các mô hình hiện có trong việc tạo ra các ứng cử viên thuốc cho các mục tiêu cụ thể và đa mục tiêu, làm nổi bật tiềm năng của nó trong thiết kế thuốc hợp lý và phát triển liệu pháp.
Augmenting generative models with biomedical knowledge graphs improves targeted drug discovery.
EN: K-DREAM, a novel framework, augments diffusion-based generative models with biomedical knowledge graphs to improve targeted drug discovery. By embedding knowledge graph information, K-DREAM directs molecular generation toward candidates with improved binding affinities, predicted efficacy, and biological relevance. Experiments show K-DREAM outperforms existing models in generating drug candidates for specific and multiple targets, highlighting its potential for rational drug design and therapeutic development.
VI: K-DREAM, một khung làm việc mới, tăng cường các mô hình sinh dựa trên khuếch tán bằng đồ thị tri thức y sinh để cải thiện việc khám phá thuốc hướng đích. Bằng cách nhúng thông tin từ đồ thị tri thức, K-DREAM hướng việc tạo phân tử tới các ứng cử viên có ái lực liên kết được cải thiện, hiệu quả dự đoán và tính liên quan sinh học. Các thử nghiệm cho thấy K-DREAM vượt trội hơn các mô hình hiện có trong việc tạo ra các ứng cử viên thuốc cho các mục tiêu cụ thể và đa mục tiêu, làm nổi bật tiềm năng của nó trong thiết kế thuốc hợp lý và phát triển liệu pháp.
Domain Knowledge Infused Conditional Generative Models for Accelerating Drug Discovery.
EN: This research addresses data sparsity issues in drug discovery AI, where limited overlap between pharmacokinetic (PK) and Drug-Target Interaction (DTI) datasets hinders progress. The authors propose xImagand-DKI, a diffusion model that generates PK and DTI properties from SMILES and protein inputs, even with sparse data. Key methods include infusing domain knowledge from Gene Ontology (GO) and molecular fingerprints to improve model performance. Findings demonstrate that xImagand-DKI generates synthetic PK data closely resembling real data and effectively fills gaps between datasets. This model offers a promising solution to data sparsity, potentially improving downstream tasks in drug discovery.
VI: Nghiên cứu này giải quyết vấn đề dữ liệu thưa thớt trong AI ứng dụng vào phát hiện thuốc, cụ thể là sự chồng chéo hạn chế giữa các tập dữ liệu dược động học (PK) và tương tác thuốc-mục tiêu (DTI), cản trở sự tiến bộ. Các tác giả đề xuất xImagand-DKI, một mô hình khuếch tán tạo ra các đặc tính PK và DTI từ đầu vào SMILES và protein, ngay cả khi dữ liệu thưa thớt. Các phương pháp chính bao gồm đưa kiến thức miền từ Gene Ontology (GO) và dấu vân tay phân tử để cải thiện hiệu suất của mô hình. Kết quả cho thấy xImagand-DKI tạo ra dữ liệu PK tổng hợp gần giống với dữ liệu thực và lấp đầy hiệu quả các khoảng trống giữa các tập dữ liệu. Mô hình này cung cấp một giải pháp đầy hứa hẹn cho vấn đề dữ liệu thưa thớt, có khả năng cải thiện các tác vụ hạ nguồn trong quá trình khám phá thuốc.
MagicDock: Toward Docking-oriented De Novo Ligand Design via Gradient Inversion.
EN: MagicDock is a novel framework for de novo ligand design that overcomes limitations of existing methods by using gradient inversion and differentiable surface modeling. It incorporates general docking knowledge into a backbone model and uses reverse gradient flows guided by binding prediction to generate ligands. Differentiable surface modeling, leveraging 3D point-cloud representations, ensures docking validity. The framework handles various ligand types and shows significant improvements (27.1% and 11.7% on average) over state-of-the-art baselines in experiments across 9 scenarios, demonstrating potential for biomedical applications.
VI: MagicDock là một khung làm việc mới cho việc thiết kế phối tử de novo (từ đầu), khắc phục các hạn chế của các phương pháp hiện có bằng cách sử dụng đảo ngược gradient và mô hình hóa bề mặt khả vi. Nó tích hợp kiến thức chung về docking vào một mô hình xương sống và sử dụng các luồng gradient ngược được hướng dẫn bởi dự đoán liên kết để tạo ra các phối tử. Mô hình hóa bề mặt khả vi, tận dụng biểu diễn đám mây điểm 3D, đảm bảo tính hợp lệ của docking. Khung làm việc này xử lý nhiều loại phối tử khác nhau và cho thấy sự cải thiện đáng kể (trung bình 27,1% và 11,7%) so với các đường cơ sở hiện đại trong các thí nghiệm trên 9 kịch bản, chứng minh tiềm năng cho các ứng dụng y sinh.
Denoised Diffusion for Object-Focused Image Augmentation.
EN: This research introduces a data augmentation framework for animal health monitoring using drone imagery, addressing data scarcity and farm-specific variations. It uses segmentation to isolate animals and then employs denoised diffusion models to synthesize realistic augmented images. Experiments show the augmented dataset improves animal detection performance compared to baseline models, enabling better real-time monitoring in data-limited situations.
VI: Nghiên cứu này giới thiệu một khung phương pháp tăng cường dữ liệu cho việc theo dõi sức khỏe động vật bằng hình ảnh từ máy bay không người lái, giải quyết vấn đề thiếu dữ liệu và sự khác biệt giữa các trang trại. Nó sử dụng phân đoạn để cô lập động vật và sau đó sử dụng mô hình khuếch tán khử nhiễu để tổng hợp các hình ảnh tăng cường thực tế. Các thử nghiệm cho thấy bộ dữ liệu tăng cường cải thiện hiệu suất phát hiện động vật so với các mô hình cơ sở, cho phép theo dõi thời gian thực tốt hơn trong các tình huống dữ liệu hạn chế.
A Hybrid Computational Intelligence Framework with Metaheuristic Optimization for Drug-Drug Interaction Prediction.
EN: This study proposes a hybrid computational intelligence framework for predicting drug-drug interactions (DDIs). The framework combines molecular embeddings (Mol2Vec and SMILES-BERT) with a rule-based clinical score (RBScore) to enhance DDI prediction. A neural classifier is then optimized using a novel three-stage metaheuristic algorithm (RSmpl-ACO-PSO). The model achieves high accuracy on real-world datasets (ROC-AUC 0.911, PR-AUC 0.867 on DrugBank) and demonstrates good generalization. The study highlights the individual contributions of embedding fusion, RBScore, and the optimizer to model performance, suggesting a practical approach for building reliable and interpretable DDI prediction models to support safer drug therapies and clinical decisions.
VI: Nghiên cứu này đề xuất một khung trí tuệ tính toán kết hợp để dự đoán tương tác thuốc-thuốc (DDIs). Khung này kết hợp các biểu diễn phân tử (Mol2Vec và SMILES-BERT) với điểm số lâm sàng dựa trên quy tắc (RBScore) để tăng cường khả năng dự đoán DDI. Một bộ phân loại thần kinh sau đó được tối ưu hóa bằng thuật toán metaheuristic ba giai đoạn mới (RSmpl-ACO-PSO). Mô hình đạt được độ chính xác cao trên các bộ dữ liệu thực tế (ROC-AUC 0.911, PR-AUC 0.867 trên DrugBank) và thể hiện khả năng khái quát hóa tốt. Nghiên cứu nhấn mạnh sự đóng góp riêng lẻ của việc kết hợp biểu diễn, RBScore và bộ tối ưu hóa vào hiệu suất của mô hình, gợi ý một phương pháp thực tế để xây dựng các mô hình dự đoán DDI đáng tin cậy và dễ diễn giải, hỗ trợ các liệu pháp dùng thuốc an toàn hơn và các quyết định lâm sàng.
Fitzpatrick Thresholding for Skin Image Segmentation.
EN: This study addresses the issue of skin image segmentation models performing poorly on darker skin tones (Fitzpatrick VI) in psoriasis rash analysis, affecting accurate BSA estimation. The researchers created a labeled psoriasis dataset with Fitzpatrick skin type annotations and segmentation masks. They trained U-Net, ResU-Net, and SETR-small models and then optimized decision thresholds, globally and per Fitzpatrick type, to improve segmentation accuracy, particularly for Fitzpatrick VI skin. Fitzpatrick-specific thresholding significantly improved performance on the darkest skin tone, with increases of up to 31% in binary IoU and 24% in Dice score for U-Net. Given the high accuracy of Fitzpatrick skin tone classifiers, the method is simple, cost-effective, model-agnostic and requires no retraining. It is proposed as a fairness baseline for future research.
VI: Nghiên cứu này giải quyết vấn đề các mô hình phân vùng ảnh da hoạt động kém trên các tông da tối màu (Fitzpatrick VI) trong phân tích phát ban vẩy nến, ảnh hưởng đến việc ước tính chính xác diện tích bề mặt cơ thể (BSA). Các nhà nghiên cứu đã tạo ra một tập dữ liệu vẩy nến được gắn nhãn với chú thích loại da Fitzpatrick và mặt nạ phân vùng. Họ đã huấn luyện các mô hình U-Net, ResU-Net và SETR-small, sau đó tối ưu hóa ngưỡng quyết định, trên toàn cầu và theo từng loại Fitzpatrick, để cải thiện độ chính xác của phân vùng, đặc biệt đối với da Fitzpatrick VI. Việc đặt ngưỡng cụ thể cho Fitzpatrick đã cải thiện đáng kể hiệu suất trên tông da tối nhất, với mức tăng lên đến 31% về IoU nhị phân và 24% về điểm Dice cho U-Net. Với độ chính xác cao của các bộ phân loại tông màu da Fitzpatrick, phương pháp này đơn giản, hiệu quả về chi phí, không phụ thuộc vào mô hình và không yêu cầu đào tạo lại. Nó được đề xuất như một đường cơ sở công bằng cho nghiên cứu trong tương lai.
Physics-Informed Machine Learning in Biomedical Science and Engineering.
EN: Physics-informed machine learning (PIML) is emerging as a potentially transformative paradigm for modeling complex biomedical systems by integrating parameterized physical laws with data-driven methods. Here, we review three main classes of PIML frameworks: physics-informed neural networks (PINNs), neural ordinary differential equations (NODEs), and neural operators (NOs), highlighting their growing role in biomedical science and engineering. We begin with PINNs, which embed governing equations into deep learning models and have been successfully applied to biosolid and biofluid mechanics, mechanobiology, and medical imaging among other areas. We then review NODEs, which offer continuous-time modeling, especially suited to dynamic physiological systems, pharmacokinetics, and cell signaling. Finally, we discuss deep NOs as powerful tools for learning mappings between function spaces, enabling efficient simulations across multiscale and spatially heterogeneous biological domains. Throughout, we emphasize applications where physical interpretability, data scarcity, or system complexity make conventional black-box learning insufficient. We conclude by identifying open challenges and fu...
VI: Học máy có tích hợp vật lý (PIML) đang nổi lên như một mô hình có khả năng thay đổi cách thức mô hình hóa các hệ thống y sinh phức tạp bằng cách tích hợp các định luật vật lý được tham số hóa với các phương pháp dựa trên dữ liệu. Ở đây, chúng tôi đánh giá ba loại khung PIML chính: mạng thần kinh tích hợp vật lý (PINN), phương trình vi phân thường thần kinh (NODE) và toán tử thần kinh (NO), làm nổi bật vai trò ngày càng tăng của chúng trong khoa học và kỹ thuật y sinh. Chúng tôi bắt đầu với PINN, nhúng các phương trình điều khiển vào các mô hình học sâu và đã được áp dụng thành công cho cơ học chất rắn sinh học và chất lỏng sinh học, cơ sinh học và hình ảnh y tế, cùng nhiều lĩnh vực khác. Sau đó, chúng tôi đánh giá NODE, cung cấp mô hình hóa thời gian liên tục, đặc biệt phù hợp với các hệ thống sinh lý động, dược động học và tín hiệu tế bào. Cuối cùng, chúng tôi thảo luận về NO sâu như một công cụ mạnh mẽ để học các ánh xạ giữa các không gian hàm, cho phép mô phỏng hiệu quả trên các miền sinh học đa tỷ lệ và không đồng nhất về mặt không gian. Xuyên suốt, chúng tôi nhấn mạnh các ứng dụng mà tính giải thích vật lý, sự khan hiếm dữ liệu hoặc độ phức tạp của hệ thống khiến việc học hộp đen thông thường là không đủ. Chúng tôi kết luận bằng cách xác định các thách thức còn tồn tại và các hướng đi trong tương lai để thúc đẩy PIML trong khoa học và kỹ thuật y sinh, bao gồm các vấn đề về định lượng độ không chắc chắn, khái quát hóa và tích hợp PIML và các mô hình ngôn ngữ lớn.
Physics-Informed Machine Learning in Biomedical Science and Engineering.
EN: Physics-informed machine learning (PIML) is emerging as a potentially transformative paradigm for modeling complex biomedical systems by integrating parameterized physical laws with data-driven methods. Here, we review three main classes of PIML frameworks: physics-informed neural networks (PINNs), neural ordinary differential equations (NODEs), and neural operators (NOs), highlighting their growing role in biomedical science and engineering. We begin with PINNs, which embed governing equations into deep learning models and have been successfully applied to biosolid and biofluid mechanics, mechanobiology, and medical imaging among other areas. We then review NODEs, which offer continuous-time modeling, especially suited to dynamic physiological systems, pharmacokinetics, and cell signaling. Finally, we discuss deep NOs as powerful tools for learning mappings between function spaces, enabling efficient simulations across multiscale and spatially heterogeneous biological domains. Throughout, we emphasize applications where physical interpretability, data scarcity, or system complexity make conventional black-box learning insufficient. We conclude by identifying open challenges and fu...
VI: Học máy dựa trên thông tin vật lý (PIML) đang nổi lên như một mô hình có khả năng chuyển đổi cho việc mô hình hóa các hệ thống y sinh phức tạp bằng cách tích hợp các định luật vật lý tham số hóa với các phương pháp dựa trên dữ liệu. Ở đây, chúng tôi đánh giá ba lớp khung PIML chính: mạng nơ-ron dựa trên thông tin vật lý (PINN), phương trình vi phân thường nơ-ron (NODE) và toán tử nơ-ron (NO), làm nổi bật vai trò ngày càng tăng của chúng trong khoa học và kỹ thuật y sinh. Chúng tôi bắt đầu với PINN, vốn nhúng các phương trình điều khiển vào các mô hình học sâu và đã được áp dụng thành công cho cơ học chất rắn sinh học và chất lỏng sinh học, cơ sinh học và chẩn đoán hình ảnh y tế, cùng nhiều lĩnh vực khác. Sau đó, chúng tôi xem xét NODE, cung cấp mô hình hóa thời gian liên tục, đặc biệt phù hợp với các hệ thống sinh lý động, dược động học và tín hiệu tế bào. Cuối cùng, chúng tôi thảo luận về các NO sâu như những công cụ mạnh mẽ để học các ánh xạ giữa các không gian hàm, cho phép mô phỏng hiệu quả trên các miền sinh học đa tỷ lệ và không đồng nhất về mặt không gian. Trong suốt quá trình này, chúng tôi nhấn mạnh các ứng dụng mà tính dễ diễn giải vật lý, sự khan hiếm dữ liệu hoặc độ phức tạp của hệ thống khiến việc học hộp đen thông thường trở nên không đủ. Chúng tôi kết luận bằng cách xác định các thách thức mở và các hướng đi trong tương lai để thúc đẩy PIML trong khoa học và kỹ thuật y sinh, bao gồm các vấn đề về định lượng độ không chắc chắn, khái quát hóa và tích hợp PIML và các mô hình ngôn ngữ lớn.
Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI.
EN: The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. We propose a framework for synthetic biomedical time-series data generation based on advanced forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets preserve essential temporal and spectral properties of real data, which enables robust analysis while effectively addressing data scarcity and privacy challenges. Our evaluations across multiple subjects demonstrate that the generated synthetic data can serve as an effective substitute for real data and also significantly boost AI model performance. The approach maintains critical biomedical features while provides high scalability for various applications and integrates seamlessly into open-source repositories...
VI: Sự sẵn có dữ liệu hạn chế do các quy định nghiêm ngặt về quyền riêng tư và nhu cầu tài nguyên đáng kể đang cản trở nghiêm trọng sự phát triển AI dựa trên chuỗi thời gian y sinh, tạo ra một khoảng cách lớn giữa yêu cầu dữ liệu và khả năng tiếp cận. Tạo dữ liệu tổng hợp đưa ra một giải pháp đầy hứa hẹn bằng cách tạo ra các bộ dữ liệu nhân tạo duy trì các thuộc tính thống kê của dữ liệu chuỗi thời gian y sinh thực tế mà không ảnh hưởng đến tính bảo mật của bệnh nhân. Chúng tôi đề xuất một khuôn khổ để tạo dữ liệu chuỗi thời gian y sinh tổng hợp dựa trên các mô hình dự báo tiên tiến, tái tạo chính xác các tín hiệu điện sinh lý phức tạp như EEG và EMG với độ trung thực cao. Các bộ dữ liệu tổng hợp này bảo toàn các thuộc tính thời gian và phổ quan trọng của dữ liệu thực, cho phép phân tích mạnh mẽ đồng thời giải quyết hiệu quả các thách thức về khan hiếm dữ liệu và quyền riêng tư. Các đánh giá của chúng tôi trên nhiều đối tượng chứng minh rằng dữ liệu tổng hợp được tạo ra có thể đóng vai trò là một sự thay thế hiệu quả cho dữ liệu thực và cũng tăng cường đáng kể hiệu suất của mô hình AI. Cách tiếp cận này duy trì các đặc điểm y sinh quan trọng đồng thời cung cấp khả năng mở rộng cao cho các ứng dụng khác nhau và tích hợp liền mạch vào các kho lưu trữ mã nguồn mở, mở rộng đáng kể nguồn lực cho nghiên cứu y sinh dựa trên AI.
Forecasting-Based Biomedical Time-series Data Synthesis for Open Data and Robust AI.
EN: The limited data availability due to strict privacy regulations and significant resource demands severely constrains biomedical time-series AI development, which creates a critical gap between data requirements and accessibility. Synthetic data generation presents a promising solution by producing artificial datasets that maintain the statistical properties of real biomedical time-series data without compromising patient confidentiality. We propose a framework for synthetic biomedical time-series data generation based on advanced forecasting models that accurately replicates complex electrophysiological signals such as EEG and EMG with high fidelity. These synthetic datasets preserve essential temporal and spectral properties of real data, which enables robust analysis while effectively addressing data scarcity and privacy challenges. Our evaluations across multiple subjects demonstrate that the generated synthetic data can serve as an effective substitute for real data and also significantly boost AI model performance. The approach maintains critical biomedical features while provides high scalability for various applications and integrates seamlessly into open-source repositories...
VI: Sự sẵn có dữ liệu hạn chế do các quy định nghiêm ngặt về quyền riêng tư và yêu cầu tài nguyên đáng kể đang hạn chế nghiêm trọng sự phát triển AI chuỗi thời gian y sinh, tạo ra một khoảng cách quan trọng giữa yêu cầu dữ liệu và khả năng tiếp cận. Tạo dữ liệu tổng hợp đưa ra một giải pháp đầy hứa hẹn bằng cách tạo ra các bộ dữ liệu nhân tạo duy trì các thuộc tính thống kê của dữ liệu chuỗi thời gian y sinh thực mà không ảnh hưởng đến tính bảo mật của bệnh nhân. Chúng tôi đề xuất một khuôn khổ cho việc tạo dữ liệu chuỗi thời gian y sinh tổng hợp dựa trên các mô hình dự báo tiên tiến, tái tạo chính xác các tín hiệu điện sinh lý phức tạp như EEG và EMG với độ trung thực cao. Các bộ dữ liệu tổng hợp này bảo tồn các thuộc tính thời gian và phổ quan trọng của dữ liệu thực, cho phép phân tích mạnh mẽ đồng thời giải quyết hiệu quả tình trạng khan hiếm dữ liệu và các thách thức về quyền riêng tư. Các đánh giá của chúng tôi trên nhiều đối tượng chứng minh rằng dữ liệu tổng hợp được tạo ra có thể đóng vai trò là một sự thay thế hiệu quả cho dữ liệu thực và cũng thúc đẩy đáng kể hiệu suất của mô hình AI. Phương pháp này duy trì các tính năng y sinh quan trọng đồng thời cung cấp khả năng mở rộng cao cho các ứng dụng khác nhau và tích hợp liền mạch vào các kho lưu trữ mã nguồn mở, mở rộng đáng kể các nguồn lực cho nghiên cứu y sinh dựa trên AI.
Ensemble Deep Learning and LLM-Assisted Reporting for Automated Skin Lesion Diagnosis.
EN: Cutaneous malignancies demand early detection for favorable outcomes, yet current diagnostics suffer from inter-observer variability and access disparities. While AI shows promise, existing dermatological systems are limited by homogeneous architectures, dataset biases across skin tones, and fragmented approaches that treat natural language processing as separate post-hoc explanations rather than integral to clinical decision-making. We introduce a unified framework that fundamentally reimagines AI integration for dermatological diagnostics through two synergistic innovations. First, a purposefully heterogeneous ensemble of architecturally diverse convolutional neural networks provides complementary diagnostic perspectives, with an intrinsic uncertainty mechanism flagging discordant cases for specialist review -- mimicking clinical best practices. Second, we embed large language model capabilities directly into the diagnostic workflow, transforming classification outputs into clinically meaningful assessments that simultaneously fulfill medical documentation requirements and deliver patient-centered education. This seamless integration generates structured reports featuring precise...
VI: Các khối u ác tính ở da đòi hỏi phải được phát hiện sớm để có kết quả điều trị tốt, tuy nhiên, các phương pháp chẩn đoán hiện tại còn gặp phải sự khác biệt giữa các người quan sát và sự bất bình đẳng trong tiếp cận. Mặc dù AI đầy hứa hẹn, nhưng các hệ thống da liễu hiện tại bị hạn chế bởi các kiến trúc đồng nhất, sự thiên vị dữ liệu trên các tông màu da khác nhau và các phương pháp tiếp cận rời rạc, coi xử lý ngôn ngữ tự nhiên như là các giải thích hậu kỳ riêng biệt thay vì là một phần không thể thiếu trong việc ra quyết định lâm sàng. Chúng tôi giới thiệu một khung tích hợp thống nhất, tái hình dung lại một cách cơ bản việc tích hợp AI cho chẩn đoán da liễu thông qua hai cải tiến hiệp đồng. Thứ nhất, một tập hợp đa dạng một cách có chủ đích các mạng nơ-ron tích chập khác nhau về kiến trúc cung cấp các góc nhìn chẩn đoán bổ sung, với một cơ chế không chắc chắn nội tại gắn cờ các trường hợp bất đồng để chuyên gia xem xét lại - mô phỏng các thực hành lâm sàng tốt nhất. Thứ hai, chúng tôi nhúng trực tiếp các khả năng của mô hình ngôn ngữ lớn vào quy trình làm việc chẩn đoán, chuyển đổi các đầu ra phân loại thành các đánh giá có ý nghĩa lâm sàng, đồng thời đáp ứng các yêu cầu về tài liệu y tế và cung cấp giáo dục hướng đến bệnh nhân. Sự tích hợp liền mạch này tạo ra các báo cáo có cấu trúc, đặc trưng chính xác tổn thương, lý luận chẩn đoán dễ hiểu và hướng dẫn theo dõi có thể hành động - trao quyền cho bệnh nhân nhận biết các dấu hiệu cảnh báo sớm giữa các lần khám. Bằng cách giải quyết cả độ tin cậy chẩn đoán và các rào cản giao tiếp trong một hệ thống gắn kết duy nhất, cách tiếp cận của chúng tôi thu hẹp khoảng cách chuyển đổi quan trọng đã ngăn cản các triển khai AI trước đây đạt được tác động lâm sàng. Khung này đại diện cho một tiến bộ đáng kể hướng tới AI da liễu có thể triển khai, giúp nâng cao độ chính xác chẩn đoán đồng thời tích cực hỗ trợ sự liên tục của quá trình chăm sóc từ phát hiện ban đầu đến giáo dục bệnh nhân, cuối cùng cải thiện tỷ lệ can thiệp sớm cho các tổn thương da.
Predictive Modeling and Explainable AI for Veterinary Safety Profiles, Residue Assessment, and Health Outcomes Using Real-World Data and Physicochemical Properties.
EN: The safe use of pharmaceuticals in food-producing animals is vital to protect animal welfare and human food safety. Adverse events (AEs) may signal unexpected pharmacokinetic or toxicokinetic effects, increasing the risk of violative residues in the food chain. This study introduces a predictive framework for classifying outcomes (Death vs. Recovery) using ~1.28 million reports (1987-2025 Q1) from the U.S. FDA's OpenFDA Center for Veterinary Medicine. A preprocessing pipeline merged relational tables and standardized AEs through VeDDRA ontologies. Data were normalized, missing values imputed, and high-cardinality features reduced; physicochemical drug properties were integrated to capture chemical-residue links. We evaluated supervised models, including Random Forest, CatBoost, XGBoost, ExcelFormer, and large language models (Gemma 3-27B, Phi 3-12B). Class imbalance was addressed, such as undersampling and oversampling, with a focus on prioritizing recall for fatal outcomes. Ensemble methods(Voting, Stacking) and CatBoost performed best, achieving precision, recall, and F1-scores of 0.95. Incorporating Average Uncertainty Margin (AUM)-based pseudo-labeling of uncertain cases improv...
VI: Việc sử dụng dược phẩm an toàn ở động vật sản xuất thực phẩm là rất quan trọng để bảo vệ phúc lợi động vật và an toàn thực phẩm cho người. Các biến cố bất lợi (AE) có thể báo hiệu các tác động dược động học hoặc độc tính động học không mong muốn, làm tăng nguy cơ tồn dư vượt ngưỡng trong chuỗi thực phẩm. Nghiên cứu này giới thiệu một khung dự đoán để phân loại kết quả (Tử vong so với Phục hồi) sử dụng ~1,28 triệu báo cáo (1987-2025 Quý 1) từ Trung tâm Y học Thú y OpenFDA của FDA Hoa Kỳ. Một quy trình tiền xử lý đã hợp nhất các bảng quan hệ và chuẩn hóa các AE thông qua các hệ thống phân loại VeDDRA. Dữ liệu đã được chuẩn hóa, các giá trị bị thiếu được điền vào và các đặc trưng có độ phức tạp cao được giảm bớt; các thuộc tính lý hóa của thuốc đã được tích hợp để nắm bắt các liên kết hóa chất-tồn dư. Chúng tôi đã đánh giá các mô hình học có giám sát, bao gồm Random Forest, CatBoost, XGBoost, ExcelFormer và các mô hình ngôn ngữ lớn (Gemma 3-27B, Phi 3-12B). Sự mất cân bằng lớp đã được giải quyết, chẳng hạn như lấy mẫu dưới và lấy mẫu trên, tập trung vào việc ưu tiên độ nhạy cho các kết quả tử vong. Các phương pháp kết hợp (Voting, Stacking) và CatBoost hoạt động tốt nhất, đạt được độ chính xác, độ nhạy và điểm F1 là 0,95. Việc kết hợp gán nhãn giả dựa trên Average Uncertainty Margin (AUM) cho các trường hợp không chắc chắn đã cải thiện khả năng phát hiện lớp thiểu số, đặc biệt là trong ExcelFormer và XGBoost. Tính giải thích thông qua SHAP đã xác định các yếu tố dự đoán hợp lý về mặt sinh học, bao gồm rối loạn phổi, tim và phế quản, nhân khẩu học động vật và các thuộc tính lý hóa của thuốc. Những đặc trưng này có liên quan chặt chẽ đến các kết quả tử vong. Nhìn chung, khung này cho thấy rằng việc kết hợp kỹ thuật dữ liệu chặt chẽ, máy học tiên tiến và AI có thể giải thích được cho phép dự đoán chính xác, có thể giải thích được về các kết quả an toàn thú y. Cách tiếp cận này hỗ trợ sứ mệnh của FARAD bằng cách cho phép phát hiện sớm các hồ sơ sự kiện-thuốc có rủi ro cao, tăng cường đánh giá rủi ro tồn dư và cung cấp thông tin cho việc ra quyết định pháp lý và lâm sàng.
AQUAIR: A High-Resolution Indoor Environmental Quality Dataset for Smart Aquaculture Monitoring.
EN: Smart aquaculture systems depend on rich environmental data streams to protect fish welfare, optimize feeding, and reduce energy use. Yet public datasets that describe the air surrounding indoor tanks remain scarce, limiting the development of forecasting and anomaly-detection tools that couple head-space conditions with water-quality dynamics. We therefore introduce AQUAIR, an open-access public dataset that logs six Indoor Environmental Quality (IEQ) variables--air temperature, relative humidity, carbon dioxide, total volatile organic compounds, PM2.5 and PM10--inside a fish aquaculture facility in Amghass, Azrou, Morocco. A single Awair HOME monitor sampled every five minutes from 14 October 2024 to 9 January 2025, producing more than 23,000 time-stamped observations that are fully quality-controlled and publicly archived on Figshare. We describe the sensor placement, ISO-compliant mounting height, calibration checks against reference instruments, and an open-source processing pipeline that normalizes timestamps, interpolates short gaps, and exports analysis-ready tables. Exploratory statistics show stable conditions (median CO2 = 758 ppm; PM2.5 = 12 micrograms/m3) with pronounc...
VI: Các hệ thống nuôi trồng thủy sản thông minh phụ thuộc vào các luồng dữ liệu môi trường phong phú để bảo vệ phúc lợi của cá, tối ưu hóa việc cho ăn và giảm sử dụng năng lượng. Tuy nhiên, các bộ dữ liệu công khai mô tả không khí xung quanh các bể trong nhà vẫn còn khan hiếm, hạn chế sự phát triển của các công cụ dự báo và phát hiện bất thường kết hợp các điều kiện không gian đầu với động lực học chất lượng nước. Do đó, chúng tôi giới thiệu AQUAIR, một bộ dữ liệu công khai truy cập mở ghi lại sáu biến Số lượng Môi trường Trong nhà (IEQ) - nhiệt độ không khí, độ ẩm tương đối, carbon dioxide, tổng hợp chất hữu cơ dễ bay hơi, PM2.5 và PM10 - bên trong một cơ sở nuôi trồng thủy sản ở Amghass, Azrou, Morocco. Một màn hình Awair HOME duy nhất lấy mẫu cứ sau năm phút từ ngày 14 tháng 10 năm 2024 đến ngày 9 tháng 1 năm 2025, tạo ra hơn 23.000 quan sát được đóng dấu thời gian, được kiểm soát chất lượng đầy đủ và lưu trữ công khai trên Figshare. Chúng tôi mô tả vị trí đặt cảm biến, chiều cao lắp đặt tuân thủ ISO, kiểm tra hiệu chuẩn so với các thiết bị tham chiếu và một quy trình xử lý mã nguồn mở chuẩn hóa dấu thời gian, nội suy các khoảng trống ngắn và xuất các bảng sẵn sàng phân tích. Thống kê thăm dò cho thấy các điều kiện ổn định (CO2 trung bình = 758 ppm; PM2.5 = 12 microgam/m3) với các đỉnh điểm rõ rệt vào thời gian cho ăn, cung cấp cấu trúc phong phú cho dự báo ngắn hạn, phát hiện sự kiện và nghiên cứu độ trôi của cảm biến. Do đó, AQUAIR lấp đầy một khoảng trống quan trọng trong tin học nuôi trồng thủy sản thông minh và cung cấp một chuẩn mực có thể tái tạo cho chương trình giảng dạy về máy học tập trung vào dữ liệu và nghiên cứu cảm biến môi trường tập trung vào động lực học không gian đầu trong các hệ thống nuôi trồng thủy sản tuần hoàn.
From Noise to Knowledge: A Comparative Study of Acoustic Anomaly Detection Models in Pumped-storage Hydropower Plants.
EN: In the context of industrial factories and energy producers, unplanned outages are highly costly and difficult to service. However, existing acoustic-anomaly detection studies largely rely on generic industrial or synthetic datasets, with few focused on hydropower plants due to limited access. This paper presents a comparative analysis of acoustic-based anomaly detection methods, as a way to improve predictive maintenance in hydropower plants. We address key challenges in the acoustic preprocessing under highly noisy conditions before extracting time- and frequency-domain features. Then, we benchmark three machine learning models: LSTM AE, K-Means, and OC-SVM, which are tested on two real-world datasets from the Rodundwerk II pumped-storage plant in Austria, one with induced anomalies and one with real-world conditions. The One-Class SVM achieved the best trade-off of accuracy (ROC AUC 0.966-0.998) and minimal training time, while the LSTM autoencoder delivered strong detection (ROC AUC 0.889-0.997) at the expense of higher computational cost.
VI: Trong bối cảnh các nhà máy công nghiệp và nhà sản xuất năng lượng, các sự cố ngừng hoạt động không lường trước gây tốn kém và khó khắc phục. Tuy nhiên, các nghiên cứu hiện có về phát hiện bất thường âm thanh chủ yếu dựa vào các bộ dữ liệu công nghiệp hoặc tổng hợp chung, với ít nghiên cứu tập trung vào các nhà máy thủy điện do hạn chế tiếp cận. Bài báo này trình bày một phân tích so sánh các phương pháp phát hiện bất thường dựa trên âm thanh, như một cách để cải thiện bảo trì dự đoán trong các nhà máy thủy điện. Chúng tôi giải quyết các thách thức chính trong quá trình tiền xử lý âm thanh trong điều kiện ồn ào cao trước khi trích xuất các đặc trưng miền thời gian và tần số. Sau đó, chúng tôi đánh giá ba mô hình học máy: LSTM AE, K-Means và OC-SVM, được thử nghiệm trên hai bộ dữ liệu thực tế từ nhà máy tích năng bơm Rodundwerk II ở Áo, một bộ có các bất thường được tạo ra và một bộ có điều kiện thực tế. One-Class SVM đạt được sự cân bằng tốt nhất giữa độ chính xác (ROC AUC 0.966-0.998) và thời gian huấn luyện tối thiểu, trong khi bộ tự mã hóa LSTM mang lại khả năng phát hiện mạnh mẽ (ROC AUC 0.889-0.997) với chi phí tính toán cao hơn.
PhenoMoler: Phenotype-Guided Molecular Optimization via Chemistry Large Language Model.
EN: Current molecular generative models primarily focus on improving drug-target binding affinity and specificity, often neglecting the system-level phenotypic effects elicited by compounds. Transcriptional profiles, as molecule-level readouts of drug-induced phenotypic shifts, offer a powerful opportunity to guide molecular design in a phenotype-aware manner. We present PhenoMoler, a phenotype-guided molecular generation framework that integrates a chemistry large language model with expression profiles to enable biologically informed drug design. By conditioning the generation on drug-induced differential expression signatures, PhenoMoler explicitly links transcriptional responses to chemical structure. By selectively masking and reconstructing specific substructures-scaffolds, side chains, or linkers-PhenoMoler supports fine-grained, controllable molecular optimization. Extensive experiments demonstrate that PhenoMoler generates chemically valid, novel, and diverse molecules aligned with desired phenotypic profiles. Compared to FDA-approved drugs, the generated compounds exhibit comparable or enhanced drug-likeness (QED), optimized physicochemical properties, and superior binding af...
VI: Các mô hình tạo sinh phân tử hiện tại chủ yếu tập trung vào việc cải thiện ái lực và độ đặc hiệu liên kết thuốc-mục tiêu, thường bỏ qua các tác động kiểu hình cấp hệ thống do các hợp chất gây ra. Hồ sơ phiên mã, như là các chỉ số cấp phân tử về sự thay đổi kiểu hình do thuốc gây ra, mang đến một cơ hội mạnh mẽ để hướng dẫn thiết kế phân tử theo cách nhận biết kiểu hình. Chúng tôi giới thiệu PhenoMoler, một khung tạo sinh phân tử được hướng dẫn bởi kiểu hình, tích hợp một mô hình ngôn ngữ lớn hóa học với hồ sơ biểu hiện để cho phép thiết kế thuốc dựa trên thông tin sinh học. Bằng cách điều kiện hóa việc tạo sinh trên các dấu hiệu biểu hiện khác biệt do thuốc gây ra, PhenoMoler liên kết một cách rõ ràng các phản ứng phiên mã với cấu trúc hóa học. Bằng cách che chắn và tái tạo có chọn lọc các cấu trúc con cụ thể - khung sườn, chuỗi bên hoặc liên kết - PhenoMoler hỗ trợ tối ưu hóa phân tử có thể kiểm soát, chi tiết. Các thử nghiệm mở rộng chứng minh rằng PhenoMoler tạo ra các phân tử hợp lệ về mặt hóa học, mới lạ và đa dạng phù hợp với các hồ sơ kiểu hình mong muốn. So với các loại thuốc đã được FDA phê duyệt, các hợp chất được tạo ra thể hiện tính chất giống thuốc tương đương hoặc nâng cao (QED), các đặc tính lý hóa được tối ưu hóa và ái lực liên kết vượt trội với các mục tiêu ung thư quan trọng. Những phát hiện này làm nổi bật tiềm năng của PhenoMoler trong việc tối ưu hóa phân tử được hướng dẫn bởi kiểu hình và có thể kiểm soát cấu trúc.
PhenoMoler: Phenotype-Guided Molecular Optimization via Chemistry Large Language Model.
EN: Current molecular generative models primarily focus on improving drug-target binding affinity and specificity, often neglecting the system-level phenotypic effects elicited by compounds. Transcriptional profiles, as molecule-level readouts of drug-induced phenotypic shifts, offer a powerful opportunity to guide molecular design in a phenotype-aware manner. We present PhenoMoler, a phenotype-guided molecular generation framework that integrates a chemistry large language model with expression profiles to enable biologically informed drug design. By conditioning the generation on drug-induced differential expression signatures, PhenoMoler explicitly links transcriptional responses to chemical structure. By selectively masking and reconstructing specific substructures-scaffolds, side chains, or linkers-PhenoMoler supports fine-grained, controllable molecular optimization. Extensive experiments demonstrate that PhenoMoler generates chemically valid, novel, and diverse molecules aligned with desired phenotypic profiles. Compared to FDA-approved drugs, the generated compounds exhibit comparable or enhanced drug-likeness (QED), optimized physicochemical properties, and superior binding af...
VI: Các mô hình sinh phân tử hiện tại chủ yếu tập trung vào việc cải thiện ái lực và tính đặc hiệu liên kết thuốc-mục tiêu, thường bỏ qua các tác động kiểu hình cấp hệ thống do các hợp chất gây ra. Hồ sơ phiên mã, như các kết quả đọc mức phân tử về sự thay đổi kiểu hình do thuốc gây ra, mang lại một cơ hội mạnh mẽ để hướng dẫn thiết kế phân tử theo cách nhận biết kiểu hình. Chúng tôi giới thiệu PhenoMoler, một khung tạo phân tử có hướng dẫn kiểu hình, tích hợp một mô hình ngôn ngữ lớn về hóa học với hồ sơ biểu hiện để cho phép thiết kế thuốc dựa trên thông tin sinh học. Bằng cách điều kiện hóa việc tạo trên các dấu hiệu biểu hiện vi sai do thuốc gây ra, PhenoMoler liên kết rõ ràng các phản ứng phiên mã với cấu trúc hóa học. Bằng cách che và tái tạo có chọn lọc các cấu trúc con cụ thể - khung, chuỗi bên hoặc liên kết - PhenoMoler hỗ trợ tối ưu hóa phân tử có thể điều khiển, chi tiết. Các thí nghiệm mở rộng chứng minh rằng PhenoMoler tạo ra các phân tử hợp lệ về mặt hóa học, mới lạ và đa dạng, phù hợp với các hồ sơ kiểu hình mong muốn. So với các loại thuốc được FDA phê duyệt, các hợp chất được tạo ra thể hiện tính chất giống thuốc tương đương hoặc nâng cao (QED), các đặc tính lý hóa được tối ưu hóa và ái lực liên kết vượt trội với các mục tiêu ung thư chính. Những phát hiện này làm nổi bật tiềm năng của PhenoMoler trong việc tối ưu hóa phân tử có hướng dẫn kiểu hình và có thể kiểm soát cấu trúc.
PhenoMoler: Phenotype-Guided Molecular Optimization via Chemistry Large Language Model.
EN: Current molecular generative models primarily focus on improving drug-target binding affinity and specificity, often neglecting the system-level phenotypic effects elicited by compounds. Transcriptional profiles, as molecule-level readouts of drug-induced phenotypic shifts, offer a powerful opportunity to guide molecular design in a phenotype-aware manner. We present PhenoMoler, a phenotype-guided molecular generation framework that integrates a chemistry large language model with expression profiles to enable biologically informed drug design. By conditioning the generation on drug-induced differential expression signatures, PhenoMoler explicitly links transcriptional responses to chemical structure. By selectively masking and reconstructing specific substructures-scaffolds, side chains, or linkers-PhenoMoler supports fine-grained, controllable molecular optimization. Extensive experiments demonstrate that PhenoMoler generates chemically valid, novel, and diverse molecules aligned with desired phenotypic profiles. Compared to FDA-approved drugs, the generated compounds exhibit comparable or enhanced drug-likeness (QED), optimized physicochemical properties, and superior binding af...
VI: Các mô hình tạo sinh phân tử hiện tại chủ yếu tập trung vào việc cải thiện ái lực và tính đặc hiệu liên kết thuốc-mục tiêu, thường bỏ qua các ảnh hưởng kiểu hình ở cấp độ hệ thống do các hợp chất gây ra. Hồ sơ phiên mã, như các thông số đọc ở cấp độ phân tử về sự thay đổi kiểu hình do thuốc gây ra, mang lại một cơ hội mạnh mẽ để hướng dẫn thiết kế phân tử theo cách nhận biết kiểu hình. Chúng tôi giới thiệu PhenoMoler, một khung tạo sinh phân tử được hướng dẫn bởi kiểu hình, tích hợp một mô hình ngôn ngữ lớn hóa học với hồ sơ biểu hiện để cho phép thiết kế thuốc có thông tin sinh học. Bằng cách điều kiện hóa quá trình tạo sinh dựa trên các dấu hiệu biểu hiện khác biệt do thuốc gây ra, PhenoMoler liên kết rõ ràng các phản ứng phiên mã với cấu trúc hóa học. Bằng cách che và tái cấu trúc có chọn lọc các cấu trúc con cụ thể - giàn giáo, chuỗi bên hoặc liên kết - PhenoMoler hỗ trợ tối ưu hóa phân tử có thể kiểm soát, chi tiết. Các thử nghiệm sâu rộng chứng minh rằng PhenoMoler tạo ra các phân tử hợp lệ về mặt hóa học, mới lạ và đa dạng phù hợp với các hồ sơ kiểu hình mong muốn. So với các loại thuốc được FDA phê duyệt, các hợp chất được tạo ra thể hiện dược tính tương đương hoặc nâng cao (QED), các đặc tính lý hóa được tối ưu hóa và ái lực liên kết vượt trội với các mục tiêu ung thư quan trọng. Những phát hiện này làm nổi bật tiềm năng của PhenoMoler trong việc tối ưu hóa phân tử có thể kiểm soát cấu trúc và được hướng dẫn bởi kiểu hình.
PhenoMoler: Phenotype-Guided Molecular Optimization via Chemistry Large Language Model.
EN: Current molecular generative models primarily focus on improving drug-target binding affinity and specificity, often neglecting the system-level phenotypic effects elicited by compounds. Transcriptional profiles, as molecule-level readouts of drug-induced phenotypic shifts, offer a powerful opportunity to guide molecular design in a phenotype-aware manner. We present PhenoMoler, a phenotype-guided molecular generation framework that integrates a chemistry large language model with expression profiles to enable biologically informed drug design. By conditioning the generation on drug-induced differential expression signatures, PhenoMoler explicitly links transcriptional responses to chemical structure. By selectively masking and reconstructing specific substructures-scaffolds, side chains, or linkers-PhenoMoler supports fine-grained, controllable molecular optimization. Extensive experiments demonstrate that PhenoMoler generates chemically valid, novel, and diverse molecules aligned with desired phenotypic profiles. Compared to FDA-approved drugs, the generated compounds exhibit comparable or enhanced drug-likeness (QED), optimized physicochemical properties, and superior binding af...
VI: Các mô hình tạo sinh phân tử hiện tại chủ yếu tập trung vào việc cải thiện ái lực và tính đặc hiệu liên kết thuốc-mục tiêu, thường bỏ qua các tác động kiểu hình cấp hệ thống do hợp chất gây ra. Hồ sơ phiên mã, như các kết quả đọc cấp phân tử về sự thay đổi kiểu hình do thuốc gây ra, mang đến một cơ hội mạnh mẽ để hướng dẫn thiết kế phân tử theo cách có ý thức về kiểu hình. Chúng tôi giới thiệu PhenoMoler, một khung tạo sinh phân tử được hướng dẫn bởi kiểu hình, tích hợp một mô hình ngôn ngữ lớn hóa học với các hồ sơ biểu hiện để cho phép thiết kế thuốc dựa trên thông tin sinh học. Bằng cách điều kiện hóa việc tạo sinh dựa trên các chữ ký biểu hiện khác biệt do thuốc gây ra, PhenoMoler liên kết một cách rõ ràng các phản ứng phiên mã với cấu trúc hóa học. Bằng cách che và tái tạo một cách chọn lọc các cấu trúc con cụ thể - khung, chuỗi bên hoặc liên kết - PhenoMoler hỗ trợ tối ưu hóa phân tử có thể kiểm soát, chi tiết. Các thử nghiệm mở rộng chứng minh rằng PhenoMoler tạo ra các phân tử hợp lệ về mặt hóa học, mới lạ và đa dạng, phù hợp với các hồ sơ kiểu hình mong muốn. So với các loại thuốc được FDA phê duyệt, các hợp chất được tạo ra thể hiện tính chất giống thuốc (QED) tương đương hoặc nâng cao, các đặc tính lý hóa được tối ưu hóa và ái lực liên kết vượt trội với các mục tiêu ung thư quan trọng. Những phát hiện này làm nổi bật tiềm năng của PhenoMoler trong việc tối ưu hóa phân tử được hướng dẫn bởi kiểu hình và có thể kiểm soát cấu trúc.
GRPO++: Enhancing Dermatological Reasoning under Low Resource Settings.
EN: Vision-Language Models (VLMs) show promise in medical image analysis, yet their capacity for structured reasoning in complex domains like dermatology is often limited by data scarcity and the high computational cost of advanced training techniques. To address these challenges, we introduce DermIQ-VLM, a VLM developed through a multi-stage, resource-efficient methodology designed to emulate a dermatologist's diagnostic process. Our primary contribution is a modified version of Grouped Relative Policy Optimization (GRPO), called GRPO++, which stabilizes the powerful but data-intensive GRPO framework. Our proposed training pipeline first employs GRPO++ for reasoning-oriented disease recognition, followed by supervised fine-tuning for conversational ability. To mitigate factual errors introduced during this step, we then align the model using Direct Preference Optimization (DPO), leveraging a Knowledge Graph-based system as a scalable proxy for expert preference. A preliminary evaluation on a curated dermatological dataset demonstrates that our proposed methodology yields notable performance gains over standard fine-tuning approaches. These findings validate the potential of our pipeli...
VI: Các Mô hình Ngôn ngữ-Thị giác (VLMs) cho thấy tiềm năng trong phân tích hình ảnh y tế, tuy nhiên khả năng lý luận có cấu trúc của chúng trong các lĩnh vực phức tạp như da liễu thường bị hạn chế bởi sự khan hiếm dữ liệu và chi phí tính toán cao của các kỹ thuật huấn luyện tiên tiến. Để giải quyết những thách thức này, chúng tôi giới thiệu DermIQ-VLM, một VLM được phát triển thông qua phương pháp đa giai đoạn, tiết kiệm tài nguyên được thiết kế để mô phỏng quy trình chẩn đoán của bác sĩ da liễu. Đóng góp chính của chúng tôi là một phiên bản sửa đổi của Tối ưu hóa Chính sách Tương đối Nhóm (GRPO), được gọi là GRPO++, giúp ổn định khung GRPO mạnh mẽ nhưng sử dụng nhiều dữ liệu. Quy trình huấn luyện được đề xuất của chúng tôi ban đầu sử dụng GRPO++ để nhận diện bệnh theo hướng lý luận, sau đó là tinh chỉnh có giám sát cho khả năng đàm thoại. Để giảm thiểu các lỗi thực tế phát sinh trong bước này, chúng tôi sau đó căn chỉnh mô hình bằng Tối ưu hóa Ưu tiên Trực tiếp (DPO), tận dụng một hệ thống dựa trên Đồ thị Tri thức như một proxy có thể mở rộng cho sở thích của chuyên gia. Đánh giá sơ bộ trên một bộ dữ liệu da liễu được tuyển chọn cho thấy phương pháp được đề xuất của chúng tôi mang lại những cải thiện hiệu suất đáng kể so với các phương pháp tinh chỉnh tiêu chuẩn. Những phát hiện này xác nhận tiềm năng của quy trình của chúng tôi như một con đường khả thi để phát triển các VLM chuyên biệt, đáng tin cậy trong môi trường hạn chế tài nguyên.
Generative data augmentation for biliary tract detection on intraoperative images.
EN: Cholecystectomy is one of the most frequently performed procedures in gastrointestinal surgery, and the laparoscopic approach is the gold standard for symptomatic cholecystolithiasis and acute cholecystitis. In addition to the advantages of a significantly faster recovery and better cosmetic results, the laparoscopic approach bears a higher risk of bile duct injury, which has a significant impact on quality of life and survival. To avoid bile duct injury, it is essential to improve the intraoperative visualization of the bile duct. This work aims to address this problem by leveraging a deep-learning approach for the localization of the biliary tract from white-light images acquired during the surgical procedures. To this end, the construction and annotation of an image database to train the Yolo detection algorithm has been employed. Besides classical data augmentation techniques, the paper proposes Generative Adversarial Network (GAN) for the generation of a synthetic portion of the training dataset. Experimental results have been discussed along with ethical considerations.
VI: Cắt túi mật là một trong những thủ thuật được thực hiện thường xuyên nhất trong phẫu thuật tiêu hóa, và phương pháp nội soi là tiêu chuẩn vàng cho sỏi mật có triệu chứng và viêm túi mật cấp tính. Bên cạnh những ưu điểm như phục hồi nhanh hơn đáng kể và kết quả thẩm mỹ tốt hơn, phương pháp nội soi mang lại nguy cơ tổn thương đường mật cao hơn, điều này có tác động đáng kể đến chất lượng cuộc sống và tỷ lệ sống sót. Để tránh tổn thương đường mật, điều cần thiết là cải thiện khả năng quan sát đường mật trong khi phẫu thuật. Nghiên cứu này nhằm mục đích giải quyết vấn đề này bằng cách tận dụng phương pháp học sâu để định vị đường mật từ hình ảnh ánh sáng trắng thu được trong quá trình phẫu thuật. Để đạt được mục tiêu này, việc xây dựng và chú thích cơ sở dữ liệu hình ảnh để huấn luyện thuật toán phát hiện Yolo đã được sử dụng. Bên cạnh các kỹ thuật tăng cường dữ liệu cổ điển, bài báo đề xuất Mạng đối nghịch sinh (GAN) để tạo ra một phần tổng hợp của bộ dữ liệu huấn luyện. Kết quả thử nghiệm đã được thảo luận cùng với các cân nhắc về mặt đạo đức.
Deep Clustering for Blood Cell Classification and Quantification.
EN: Accurate classification of blood cells plays a key role in improving automated blood analysis for both medical and veterinary applications. This work presents a two-stage deep clustering method for classifying blood cells from high-dimensional signal data. In the first stage, red blood cells (RBCs) and platelets (PLTs) are separated using a combination of an improved autoencoder and the IDEC algorithm. The second stage further classifies RBC subtypes, pure RBCs, reticulocytes, and clumped RBCs, through a variational deep embedding (VaDE) approach. Due to the lack of detailed cell-level labels, soft classification probabilities are generated from sample-level data to approximate the true distributions. The aim is to contribute to the development of low-cost, automated blood analysis systems suitable for veterinary and biomedical use. Initial results indicate this method shows promise in effectively distinguishing different blood cell populations, even with limited supervision.
VI: Việc phân loại chính xác tế bào máu đóng vai trò then chốt trong việc cải thiện phân tích máu tự động cho cả ứng dụng y tế và thú y. Nghiên cứu này trình bày phương pháp phân cụm sâu hai giai đoạn để phân loại tế bào máu từ dữ liệu tín hiệu đa chiều. Trong giai đoạn đầu tiên, tế bào hồng cầu (RBCs) và tiểu cầu (PLTs) được tách ra bằng cách sử dụng kết hợp giữa một autoencoder cải tiến và thuật toán IDEC. Giai đoạn thứ hai tiếp tục phân loại sâu hơn các loại phụ của tế bào hồng cầu, hồng cầu thuần, hồng cầu lưới và hồng cầu vón cục, thông qua phương pháp nhúng sâu biến phân (VaDE). Do thiếu nhãn chi tiết ở cấp độ tế bào, xác suất phân loại mềm được tạo ra từ dữ liệu cấp độ mẫu để xấp xỉ các phân phối thực tế. Mục tiêu là đóng góp vào sự phát triển của các hệ thống phân tích máu tự động, chi phí thấp, phù hợp cho sử dụng trong thú y và y sinh. Kết quả ban đầu cho thấy phương pháp này hứa hẹn trong việc phân biệt hiệu quả các quần thể tế bào máu khác nhau, ngay cả khi có sự giám sát hạn chế.
Interpretable Clinical Classification with Kolgomorov-Arnold Networks.
EN: Why should a clinician trust an Artificial Intelligence (AI) prediction? Despite the increasing accuracy of machine learning methods in medicine, the lack of transparency continues to hinder their adoption in clinical practice. In this work, we explore Kolmogorov-Arnold Networks (KANs) for clinical classification tasks on tabular data. Unlike traditional neural networks, KANs are function-based architectures that offer intrinsic interpretability through transparent, symbolic representations. We introduce Logistic-KAN, a flexible generalization of logistic regression, and Kolmogorov-Arnold Additive Model (KAAM), a simplified additive variant that delivers transparent, symbolic formulas. Unlike black-box models that require post-hoc explainability tools, our models support built-in patient-level insights, intuitive visualizations, and nearest-patient retrieval. Across multiple health datasets, our models match or outperform standard baselines, while remaining fully interpretable. These results position KANs as a promising step toward trustworthy AI that clinicians can understand, audit, and act upon.
VI: Tại sao một bác sĩ lâm sàng nên tin vào dự đoán của Trí tuệ Nhân tạo (AI)? Mặc dù độ chính xác của các phương pháp học máy trong y học ngày càng tăng, nhưng sự thiếu minh bạch vẫn tiếp tục cản trở việc áp dụng chúng trong thực hành lâm sàng. Trong công trình này, chúng tôi khám phá Mạng Kolmogorov-Arnold (KAN) cho các tác vụ phân loại lâm sàng trên dữ liệu dạng bảng. Không giống như các mạng nơ-ron truyền thống, KAN là các kiến trúc dựa trên hàm, cung cấp khả năng diễn giải nội tại thông qua các biểu diễn tượng trưng, minh bạch. Chúng tôi giới thiệu Logistic-KAN, một khái quát hóa linh hoạt của hồi quy logistic và Mô hình Cộng tính Kolmogorov-Arnold (KAAM), một biến thể cộng tính đơn giản hóa, cung cấp các công thức tượng trưng, minh bạch. Không giống như các mô hình hộp đen cần các công cụ giải thích hậu nghiệm, các mô hình của chúng tôi hỗ trợ thông tin chi tiết cấp độ bệnh nhân tích hợp sẵn, trực quan hóa trực quan và truy xuất bệnh nhân gần nhất. Trên nhiều bộ dữ liệu sức khỏe, các mô hình của chúng tôi phù hợp hoặc vượt trội so với các đường cơ sở tiêu chuẩn, đồng thời vẫn hoàn toàn có thể diễn giải được. Những kết quả này định vị KAN như một bước tiến đầy hứa hẹn hướng tới AI đáng tin cậy mà các bác sĩ lâm sàng có thể hiểu, kiểm toán và hành động dựa trên đó.
Non-reciprocal coalescence-breakup dynamics in concentrated emulsions.
EN: Dense stabilized emulsions are mixtures of immiscible fluids where the high-volume fraction droplet dispersed phase is stabilized against coalescence by steric interactions. The production of emulsions involves high-shear flows and it is well known that at a critical volume fraction the emulsion loses stability, undergoing an extremely rapid process where the fluid components in the emulsion exchange roles. This process, called catastrophic phase inversion, which resembles in several respects a dynamical phase transition, has remained widely elusive from an experimental and theoretical point of view. In this work, we present state-of-the-art experimental and numerical data to support a dynamical-system framework capable of precisely highlighting the dynamics occurring in the system as it approaches the catastrophic phase inversion. Our study clearly highlights that at high volume fractions, dynamical changes in the emulsion morphology, due to coalescence and breakup of droplets, play a critical role in determining emulsion's rheology and stability. Additionally, we show that at approaching the critical volume fractions, the dynamics can be simplified as being controlled by the dyna...
VI: Các nhũ tương đặc được ổn định là hỗn hợp của các chất lỏng không trộn lẫn, trong đó pha phân tán dạng giọt có tỷ lệ thể tích lớn được ổn định chống lại sự hợp nhất bởi các tương tác lập thể. Quá trình sản xuất nhũ tương liên quan đến dòng chảy có độ cắt cao và người ta biết rõ rằng ở một tỷ lệ thể tích tới hạn, nhũ tương mất ổn định, trải qua một quá trình cực kỳ nhanh chóng, trong đó các thành phần chất lỏng trong nhũ tương đổi vai trò cho nhau. Quá trình này, được gọi là sự đảo pha thảm khốc, có nhiều điểm tương đồng với sự chuyển pha động lực học, vẫn còn khó nắm bắt từ quan điểm thực nghiệm và lý thuyết. Trong công trình này, chúng tôi trình bày dữ liệu thực nghiệm và số học hiện đại để hỗ trợ một khuôn khổ hệ động lực có khả năng làm nổi bật chính xác động lực học xảy ra trong hệ thống khi nó tiếp cận sự đảo pha thảm khốc. Nghiên cứu của chúng tôi cho thấy rõ ràng rằng ở tỷ lệ thể tích cao, những thay đổi động lực học trong hình thái nhũ tương, do sự hợp nhất và vỡ của các giọt, đóng một vai trò quan trọng trong việc xác định tính lưu biến và độ ổn định của nhũ tương. Ngoài ra, chúng tôi chỉ ra rằng khi tiếp cận tỷ lệ thể tích tới hạn, động lực học có thể được đơn giản hóa như thể được kiểm soát bởi động lực học của một độ dài tương quan được biểu diễn, trong hệ thống của chúng tôi, bằng kích thước của giọt lớn nhất. Động lực học này có mối liên hệ chặt chẽ với sự chuyển pha không tương hỗ, trong đó hai cơ chế vật lý khác nhau, sự hợp nhất và vỡ, có thể mất cân bằng dẫn đến các biến đổi tuần hoàn không đối xứng lớn trong không gian pha. Chúng tôi làm rõ hiện tượng quan sát được và giải thích định lượng khía cạnh thiết yếu của động lực học cực kỳ phức tạp của các nhũ tương được ổn định trải qua sự đảo pha thảm khốc.
Evolution of surfactant-free 'pristine' emulsions.
EN: The term pristine interface was introduced by Beattie and Djerdjev 20 years ago for emulsions that consist of only water and oil with no surfactant. They are different from Pickering emulsions, which are also surfactant-free but stabilized with colloidal particles. In contrast to previous studies, we monitor the kinetics of the initial stages of emulsion formation. We conducted such tests in an open setup when samples are open to air and CO2 content in the water varies, and in closed setup when samples are isolated with fixed CO2 content. For the open setup, sonication and initial pH > 9 leads to emulsions with high zeta potential and sub-micron droplet size. There are two evolution patterns: short- and long-terms. The short term lasts about 1 day and has changing pH and zeta potential, but almost constant droplet size. The long term is is over several days or even weeks, with droplet size increase toward saturation value (rate dependent on mixing conditions), with pH and zeta potential remaining constant. Emulsification at the closed setup is much less pronounced and pH remains constant. This difference points to the importance of adsorbed CO2 and related carbonate ions in the for...
VI: Thuật ngữ giao diện nguyên sơ được Beattie và Djerdjev giới thiệu cách đây 20 năm cho các nhũ tương chỉ bao gồm nước và dầu, không có chất hoạt động bề mặt. Chúng khác với nhũ tương Pickering, cũng không chứa chất hoạt động bề mặt nhưng được ổn định bằng các hạt keo. Trái ngược với các nghiên cứu trước đây, chúng tôi theo dõi động học của các giai đoạn ban đầu của quá trình hình thành nhũ tương. Chúng tôi đã tiến hành các thử nghiệm như vậy trong thiết lập mở khi các mẫu tiếp xúc với không khí và hàm lượng CO2 trong nước thay đổi, và trong thiết lập kín khi các mẫu được cô lập với hàm lượng CO2 cố định. Đối với thiết lập mở, siêu âm và pH ban đầu > 9 dẫn đến nhũ tương có điện thế zeta cao và kích thước giọt ở mức dưới micron. Có hai dạng tiến triển: ngắn hạn và dài hạn. Ngắn hạn kéo dài khoảng 1 ngày và có pH và điện thế zeta thay đổi, nhưng kích thước giọt gần như không đổi. Dài hạn kéo dài vài ngày hoặc thậm chí vài tuần, với kích thước giọt tăng dần đến giá trị bão hòa (tốc độ phụ thuộc vào điều kiện trộn), với pH và điện thế zeta duy trì không đổi. Quá trình nhũ hóa ở thiết lập kín ít rõ rệt hơn nhiều và pH vẫn không đổi. Sự khác biệt này chỉ ra tầm quan trọng của CO2 hấp thụ và các ion cacbonat liên quan trong quá trình hình thành nhũ tương nguyên sơ và tích điện cho giao diện giọt. Chúng tôi đưa ra giả thuyết về sự tồn tại của lớp phân tử nước có cấu trúc tại giao diện, theo Eastoe và Ellis. Lớp điện kép tác dụng một lực (điện môi) lên các mômen lưỡng cực nước trong lớp này, lực này bù lại áp suất Kelvin. Kích thước giọt từ mô hình này gần với các phép đo của chúng tôi. Ngoài ra, có một lực đẩy các mômen lưỡng cực nước, lực này bù cho sức căng bề mặt song song với giao diện. Sau khi loại trừ các giả thuyết thay thế bằng dữ liệu của chúng tôi, chúng tôi kết luận rằng mô hình được đề xuất để giải thích sự ổn định của nano-bọt cũng phù hợp với kết quả của chúng tôi cho các nhũ tương nguyên sơ này.
Cosmic dust as a prerequisite for the formation of complex organic molecules in space?.
EN: In cold, dense astrophysical environments dust grains are mixed with molecular ices. Chemistry in those dust/ice mixtures is determined by diffusion and reaction of molecules and radicals. However, investigations of diffusion of astrophysically relevant radicals and molecules across the surface and through the pores of cosmic dust grains and of surface reactions consequent to such diffusion is largely uncharted territory. This paper presents results of a study of a solid-state reaction of two molecular species, CO2 and NH3, separated by a layer of porous silicate grain aggregates, analogues of cosmic dust. The experiments demonstrate that the presence of the dust layer was necessary for a pure thermal CO2 + 2NH3 reaction to proceed, leading to the formation of ammonium carbamate (NH4+NH2COO-), an ionic solid containing a complex organic moiety of prebiotic interest recently detected in a protoplanetary disk. This result speaks for: (i) efficient diffusion of molecules on/within cosmic dust, (ii) an underestimated role for surface catalysis in the astrochemistry of cosmic dust, and (iii) potentially efficient dust-promoted chemistry in warm cosmic environments, such as protostellar ...
VI: Trong môi trường vật lý thiên văn lạnh và đậm đặc, các hạt bụi trộn lẫn với băng phân tử. Phản ứng hóa học trong các hỗn hợp bụi/băng này được quyết định bởi sự khuếch tán và phản ứng của các phân tử và gốc tự do. Tuy nhiên, các nghiên cứu về sự khuếch tán của các gốc tự do và phân tử có liên quan đến vật lý thiên văn trên bề mặt và xuyên qua các lỗ rỗng của các hạt bụi vũ trụ, cũng như các phản ứng bề mặt do sự khuếch tán đó gây ra, phần lớn vẫn là một lãnh địa chưa được khám phá. Bài báo này trình bày kết quả của một nghiên cứu về phản ứng pha rắn của hai loại phân tử, CO2 và NH3, được phân tách bởi một lớp vật liệu kết tụ hạt silicat xốp, tương tự như bụi vũ trụ. Các thí nghiệm chứng minh rằng sự hiện diện của lớp bụi là cần thiết để phản ứng CO2 + 2NH3 nhiệt thuần túy diễn ra, dẫn đến sự hình thành amoni carbamate (NH4+NH2COO-), một chất rắn ion chứa một phần hữu cơ phức tạp có ý nghĩa tiền sinh học gần đây đã được phát hiện trong một đĩa tiền hành tinh. Kết quả này cho thấy: (i) sự khuếch tán hiệu quả của các phân tử trên/trong bụi vũ trụ, (ii) vai trò xúc tác bề mặt bị đánh giá thấp trong hóa học thiên văn của bụi vũ trụ và (iii) hóa học do bụi thúc đẩy có khả năng hiệu quả trong môi trường vũ trụ ấm áp, chẳng hạn như vỏ bọc tiền sao và đĩa tiền hành tinh.
Cosmic dust as a prerequisite for the formation of complex organic molecules in space?.
EN: In cold, dense astrophysical environments dust grains are mixed with molecular ices. Chemistry in those dust/ice mixtures is determined by diffusion and reaction of molecules and radicals. However, investigations of diffusion of astrophysically relevant radicals and molecules across the surface and through the pores of cosmic dust grains and of surface reactions consequent to such diffusion is largely uncharted territory. This paper presents results of a study of a solid-state reaction of two molecular species, CO2 and NH3, separated by a layer of porous silicate grain aggregates, analogues of cosmic dust. The experiments demonstrate that the presence of the dust layer was necessary for a pure thermal CO2 + 2NH3 reaction to proceed, leading to the formation of ammonium carbamate (NH4+NH2COO-), an ionic solid containing a complex organic moiety of prebiotic interest recently detected in a protoplanetary disk. This result speaks for: (i) efficient diffusion of molecules on/within cosmic dust, (ii) an underestimated role for surface catalysis in the astrochemistry of cosmic dust, and (iii) potentially efficient dust-promoted chemistry in warm cosmic environments, such as protostellar ...
VI: Trong môi trường vật lý thiên văn lạnh và dày đặc, các hạt bụi được trộn lẫn với băng phân tử. Quá trình hóa học trong hỗn hợp bụi/băng này được quyết định bởi sự khuếch tán và phản ứng của các phân tử và gốc tự do. Tuy nhiên, các nghiên cứu về sự khuếch tán của các gốc tự do và phân tử liên quan đến vật lý thiên văn trên bề mặt và xuyên qua các lỗ rỗng của các hạt bụi vũ trụ, cũng như các phản ứng bề mặt do sự khuếch tán đó gây ra, phần lớn vẫn là một vùng đất chưa được khám phá. Bài báo này trình bày kết quả của một nghiên cứu về phản ứng pha rắn của hai loại phân tử, CO2 và NH3, được ngăn cách bởi một lớp tập hợp hạt silicat xốp, tương tự như bụi vũ trụ. Các thí nghiệm chứng minh rằng sự hiện diện của lớp bụi là cần thiết để phản ứng nhiệt thuần túy CO2 + 2NH3 diễn ra, dẫn đến sự hình thành amoni cacbamat (NH4+NH2COO-), một chất rắn ion chứa một phần hữu cơ phức tạp có ý nghĩa tiền sinh học gần đây đã được phát hiện trong một đĩa tiền hành tinh. Kết quả này nói lên: (i) sự khuếch tán hiệu quả của các phân tử trên/trong bụi vũ trụ, (ii) vai trò bị đánh giá thấp của xúc tác bề mặt trong hóa học thiên văn của bụi vũ trụ và (iii) hóa học được thúc đẩy bởi bụi hiệu quả tiềm tàng trong môi trường vũ trụ ấm áp, chẳng hạn như vỏ bọc tiền sao và đĩa tiền hành tinh.
Cosmic dust as a prerequisite for the formation of complex organic molecules in space?.
EN: In cold, dense astrophysical environments dust grains are mixed with molecular ices. Chemistry in those dust/ice mixtures is determined by diffusion and reaction of molecules and radicals. However, investigations of diffusion of astrophysically relevant radicals and molecules across the surface and through the pores of cosmic dust grains and of surface reactions consequent to such diffusion is largely uncharted territory. This paper presents results of a study of a solid-state reaction of two molecular species, CO2 and NH3, separated by a layer of porous silicate grain aggregates, analogues of cosmic dust. The experiments demonstrate that the presence of the dust layer was necessary for a pure thermal CO2 + 2NH3 reaction to proceed, leading to the formation of ammonium carbamate (NH4+NH2COO-), an ionic solid containing a complex organic moiety of prebiotic interest recently detected in a protoplanetary disk. This result speaks for: (i) efficient diffusion of molecules on/within cosmic dust, (ii) an underestimated role for surface catalysis in the astrochemistry of cosmic dust, and (iii) potentially efficient dust-promoted chemistry in warm cosmic environments, such as protostellar ...
VI: Trong môi trường vật lý thiên văn lạnh và dày đặc, các hạt bụi trộn lẫn với băng phân tử. Hóa học trong hỗn hợp bụi/băng này được quyết định bởi sự khuếch tán và phản ứng của các phân tử và gốc tự do. Tuy nhiên, các nghiên cứu về sự khuếch tán của các gốc tự do và phân tử có liên quan đến vật lý thiên văn trên bề mặt và xuyên qua các lỗ rỗng của các hạt bụi vũ trụ, và các phản ứng bề mặt do sự khuếch tán đó gây ra phần lớn là một lãnh thổ chưa được khám phá. Bài báo này trình bày kết quả của một nghiên cứu về phản ứng pha rắn của hai loại phân tử, CO2 và NH3, được ngăn cách bởi một lớp tập hợp hạt silicat xốp, tương tự như bụi vũ trụ. Các thí nghiệm chứng minh rằng sự hiện diện của lớp bụi là cần thiết để phản ứng nhiệt thuần túy CO2 + 2NH3 diễn ra, dẫn đến sự hình thành của amoni cacbamat (NH4+NH2COO-), một chất rắn ion chứa một phần hữu cơ phức tạp có ý nghĩa tiền sinh học gần đây đã được phát hiện trong một đĩa tiền hành tinh. Kết quả này nói lên: (i) sự khuếch tán hiệu quả của các phân tử trên/trong bụi vũ trụ, (ii) vai trò bị đánh giá thấp của xúc tác bề mặt trong hóa học thiên văn của bụi vũ trụ và (iii) hóa học được thúc đẩy bởi bụi có khả năng hiệu quả trong môi trường vũ trụ ấm áp, chẳng hạn như lớp vỏ nguyên tinh và đĩa tiền hành tinh.
Synthetic Protein-Ligand Complex Generation for Deep Molecular Docking.
EN: The scarcity of experimental protein-ligand complexes poses a significant challenge for training robust deep learning models for molecular docking. Given the prohibitive cost and time constraints associated with experimental structure determination, scalable generation of realistic protein-ligand complexes is needed to expand available datasets for model development. In this study, we introduce a novel workflow for the procedural generation and validation of synthetic protein-ligand complexes, combining a diverse ensemble of generation techniques and rigorous quality control. We assessed the utility of these synthetic datasets by retraining established docking models, Smina and Gnina, and evaluating their performance on standard benchmarks including the PDBBind core set and the PoseBusters dataset. Our results demonstrate that models trained on synthetic data achieve performance comparable to models trained on experimental data, indicating that current synthetic complexes can effectively capture many salient features of protein-ligand interactions. However, we did not observe significant improvements in docking or scoring accuracy over conventional methods or experimental data augm...
VI: Sự khan hiếm các phức protein-ligand thực nghiệm gây ra một thách thức đáng kể cho việc huấn luyện các mô hình học sâu mạnh mẽ cho docking phân tử. Với chi phí tốn kém và hạn chế về thời gian liên quan đến việc xác định cấu trúc thực nghiệm, cần có khả năng tạo ra các phức protein-ligand thực tế có thể mở rộng để mở rộng các tập dữ liệu có sẵn cho phát triển mô hình. Trong nghiên cứu này, chúng tôi giới thiệu một quy trình mới để tạo và xác thực các phức protein-ligand tổng hợp theo quy trình, kết hợp một tập hợp đa dạng các kỹ thuật tạo và kiểm soát chất lượng nghiêm ngặt. Chúng tôi đã đánh giá tính hữu ích của các tập dữ liệu tổng hợp này bằng cách đào tạo lại các mô hình docking đã được thiết lập, Smina và Gnina, và đánh giá hiệu suất của chúng trên các tiêu chuẩn chuẩn bao gồm bộ lõi PDBBind và tập dữ liệu PoseBusters. Kết quả của chúng tôi chứng minh rằng các mô hình được huấn luyện trên dữ liệu tổng hợp đạt được hiệu suất tương đương với các mô hình được huấn luyện trên dữ liệu thực nghiệm, cho thấy rằng các phức tổng hợp hiện tại có thể nắm bắt hiệu quả nhiều đặc điểm nổi bật của tương tác protein-ligand. Tuy nhiên, chúng tôi không quan sát thấy sự cải thiện đáng kể nào về độ chính xác của việc docking hoặc chấm điểm so với các phương pháp thông thường hoặc tăng cường dữ liệu thực nghiệm. Những phát hiện này làm nổi bật những hứa hẹn cũng như những hạn chế hiện tại của dữ liệu tổng hợp đối với docking phân tử dựa trên học sâu và nhấn mạnh sự cần thiết phải tinh chỉnh hơn nữa trong phương pháp tạo và chiến lược đánh giá để khai thác đầy đủ tiềm năng của các tập dữ liệu tổng hợp cho ứng dụng này.
Early Detection of Branched Broomrape (Phelipanche ramosa) Infestation in Tomato Crops Using Leaf Spectral Analysis and Machine Learning.
EN: Branched broomrape (Phelipanche ramosa) is a chlorophyll-deficient parasitic weed that threatens tomato production by extracting nutrients from the host. We investigate early detection using leaf-level spectral reflectance (400-2500 nm) and ensemble machine learning. In a field experiment in Woodland, California, we tracked 300 tomato plants across growth stages defined by growing degree days (GDD). Leaf reflectance was acquired with a portable spectrometer and preprocessed (band denoising, 1 nm interpolation, Savitzky-Golay smoothing, correlation-based band reduction). Clear class differences were observed near 1500 nm and 2000 nm water absorption features, consistent with reduced leaf water content in infected plants at early stages. An ensemble combining Random Forest, XGBoost, SVM with RBF kernel, and Naive Bayes achieved 89% accuracy at 585 GDD, with recalls of 0.86 (infected) and 0.93 (noninfected). Accuracy declined at later stages (e.g., 69% at 1568 GDD), likely due to senescence and weed interference. Despite the small number of infected plants and environmental confounders, results show that proximal sensing with ensemble learning enables timely detection of broomrape bef...
VI: Cỏ chổi phân nhánh (Phelipanche ramosa) là một loài cỏ dại ký sinh thiếu chất diệp lục, đe dọa sản xuất cà chua bằng cách hút chất dinh dưỡng từ cây chủ. Chúng tôi nghiên cứu phát hiện sớm bằng cách sử dụng phản xạ quang phổ cấp độ lá (400-2500 nm) và học máy ансамбль. Trong một thí nghiệm thực địa ở Woodland, California, chúng tôi theo dõi 300 cây cà chua qua các giai đoạn sinh trưởng được xác định bởi ngày độ sinh trưởng (GDD). Phản xạ lá được thu thập bằng máy quang phổ cầm tay và tiền xử lý (khử nhiễu băng tần, nội suy 1 nm, làm mịn Savitzky-Golay, giảm băng tần dựa trên tương quan). Sự khác biệt rõ ràng giữa các lớp được quan sát thấy gần các đặc điểm hấp thụ nước 1500 nm và 2000 nm, phù hợp với hàm lượng nước trong lá giảm ở cây bị nhiễm bệnh ở giai đoạn đầu. Một ансамбль kết hợp Random Forest, XGBoost, SVM với nhân RBF và Naive Bayes đạt độ chính xác 89% ở 585 GDD, với độ recall lần lượt là 0,86 (bị nhiễm bệnh) và 0,93 (không bị nhiễm bệnh). Độ chính xác giảm ở các giai đoạn sau (ví dụ: 69% ở 1568 GDD), có thể là do sự lão hóa và sự can thiệp của cỏ dại. Mặc dù số lượng cây bị nhiễm bệnh ít và các yếu tố gây nhiễu môi trường, kết quả cho thấy rằng cảm biến gần kết hợp với học máy ансамбль cho phép phát hiện kịp thời cỏ chổi trước khi các triệu chứng tán lá có thể nhìn thấy, hỗ trợ các biện pháp can thiệp có mục tiêu và giảm tổn thất năng suất.
Mathematical Discovery of Potential Therapeutic Targets: Application to Rare Melanomas.
EN: Patients with rare types of melanoma such as acral, mucosal, or uveal melanoma, have lower survival rates than patients with cutaneous melanoma; these lower survival rates reflect the lower objective response rates to immunotherapy compared to cutaneous melanoma. Understanding tumor-immune dynamics in rare melanomas is critical for the development of new therapies and for improving response rates to current cancer therapies. Progress has been hindered by the lack of clinical data and the need for better preclinical models of rare melanomas. Canine melanoma provides a valuable comparative oncology model for rare types of human melanomas. We analyzed RNA sequencing data from canine melanoma patients and combined this with literature information to create a novel mechanistic mathematical model of melanoma-immune dynamics. Sensitivity analysis of the mathematical model indicated influential pathways in the dynamics, providing support for potential new therapeutic targets and future combinations of therapies. We share our learnings from this work, to help enable the application of this proof-of-concept workflow to other rare disease settings with sparse available data.
VI: Bệnh nhân mắc các loại u hắc tố hiếm gặp như u hắc tố đầu chi, niêm mạc hoặc màng bồ đào, có tỷ lệ sống sót thấp hơn so với bệnh nhân mắc u hắc tố da; tỷ lệ sống sót thấp hơn này phản ánh tỷ lệ đáp ứng khách quan thấp hơn với liệu pháp miễn dịch so với u hắc tố da. Hiểu rõ động lực học khối u-miễn dịch trong các loại u hắc tố hiếm gặp là rất quan trọng để phát triển các liệu pháp mới và cải thiện tỷ lệ đáp ứng với các liệu pháp điều trị ung thư hiện tại. Tiến độ đã bị cản trở do thiếu dữ liệu lâm sàng và cần có các mô hình tiền lâm sàng tốt hơn về các loại u hắc tố hiếm gặp. U hắc tố ở chó cung cấp một mô hình ung thư so sánh có giá trị cho các loại u hắc tố hiếm gặp ở người. Chúng tôi đã phân tích dữ liệu giải trình tự RNA từ bệnh nhân chó mắc u hắc tố và kết hợp điều này với thông tin từ các tài liệu để tạo ra một mô hình toán học cơ học mới về động lực học u hắc tố-miễn dịch. Phân tích độ nhạy của mô hình toán học cho thấy các con đường ảnh hưởng trong động lực học, hỗ trợ cho các mục tiêu điều trị tiềm năng mới và các kết hợp liệu pháp trong tương lai. Chúng tôi chia sẻ những bài học kinh nghiệm từ công việc này để giúp cho phép ứng dụng quy trình công việc chứng minh khái niệm này vào các bối cảnh bệnh hiếm gặp khác với dữ liệu có sẵn thưa thớt.
Improved Classification of Nitrogen Stress Severity in Plants Under Combined Stress Conditions Using Spatio-Temporal Deep Learning Framework.
EN: Plants in their natural habitats endure an array of interacting stresses, both biotic and abiotic, that rarely occur in isolation. Nutrient stress-particularly nitrogen deficiency-becomes even more critical when compounded with drought and weed competition, making it increasingly difficult to distinguish and address its effects. Early detection of nitrogen stress is therefore crucial for protecting plant health and implementing effective management strategies. This study proposes a novel deep learning framework to accurately classify nitrogen stress severity in a combined stress environment. Our model uses a unique blend of four imaging modalities-RGB, multispectral, and two infrared wavelengths-to capture a wide range of physiological plant responses from canopy images. These images, provided as time-series data, document plant health across three levels of nitrogen availability (low, medium, and high) under varying water stress and weed pressures. The core of our approach is a spatio-temporal deep learning pipeline that merges a Convolutional Neural Network (CNN) for extracting spatial features from images with a Long Short-Term Memory (LSTM) network to capture temporal dependenc...
VI: Trong môi trường sống tự nhiên, thực vật phải chịu đựng hàng loạt các tác động tương hỗ từ cả yếu tố sinh học và phi sinh học, những yếu tố này hiếm khi xảy ra riêng lẻ. Sự thiếu hụt dinh dưỡng - đặc biệt là thiếu nitơ - trở nên nghiêm trọng hơn khi kết hợp với hạn hán và sự cạnh tranh của cỏ dại, khiến việc phân biệt và giải quyết các tác động của nó ngày càng khó khăn. Do đó, việc phát hiện sớm tình trạng thiếu nitơ là rất quan trọng để bảo vệ sức khỏe thực vật và thực hiện các chiến lược quản lý hiệu quả. Nghiên cứu này đề xuất một khung học sâu mới để phân loại chính xác mức độ nghiêm trọng của tình trạng thiếu nitơ trong môi trường căng thẳng kết hợp. Mô hình của chúng tôi sử dụng sự kết hợp độc đáo của bốn phương thức hình ảnh - RGB, đa phổ và hai bước sóng hồng ngoại - để nắm bắt nhiều phản ứng sinh lý của thực vật từ hình ảnh tán cây. Những hình ảnh này, được cung cấp dưới dạng dữ liệu chuỗi thời gian, ghi lại sức khỏe thực vật ở ba mức độ sẵn có nitơ (thấp, trung bình và cao) dưới áp lực khác nhau về hạn hán và cỏ dại. Cốt lõi của phương pháp tiếp cận của chúng tôi là một quy trình học sâu không-thời gian, kết hợp Mạng nơ-ron tích chập (CNN) để trích xuất các đặc trưng không gian từ hình ảnh với Mạng bộ nhớ dài-ngắn hạn (LSTM) để nắm bắt các phụ thuộc thời gian. Chúng tôi cũng đã thiết kế và đánh giá một quy trình CNN chỉ dành cho không gian để so sánh. Quy trình CNN-LSTM của chúng tôi đạt được độ chính xác ấn tượng là 98%, vượt trội so với 80,45% của mô hình chỉ dành cho không gian và 76% của các phương pháp học máy đã được báo cáo trước đây. Những kết quả này mang lại những hiểu biết có giá trị dựa trên sức mạnh của phương pháp CNN-LSTM của chúng tôi trong việc nắm bắt hiệu quả các tương tác phức tạp và tinh tế giữa tình trạng thiếu nitơ, hạn hán và áp lực cỏ dại. Nền tảng mạnh mẽ này cung cấp một công cụ đầy hứa hẹn để xác định kịp thời và chủ động mức độ nghiêm trọng của tình trạng thiếu nitơ, cho phép quản lý cây trồng tốt hơn và cải thiện sức khỏe thực vật.
Towards scalable organ level 3D plant segmentation: Bridging the data algorithm computing gap.
EN: The precise characterization of plant morphology provides valuable insights into plant environment interactions and genetic evolution. A key technology for extracting this information is 3D segmentation, which delineates individual plant organs from complex point clouds. Despite significant progress in general 3D computer vision domains, the adoption of 3D segmentation for plant phenotyping remains limited by three major challenges: i) the scarcity of large-scale annotated datasets, ii) technical difficulties in adapting advanced deep neural networks to plant point clouds, and iii) the lack of standardized benchmarks and evaluation protocols tailored to plant science. This review systematically addresses these barriers by: i) providing an overview of existing 3D plant datasets in the context of general 3D segmentation domains, ii) systematically summarizing deep learning-based methods for point cloud semantic and instance segmentation, iii) introducing Plant Segmentation Studio (PSS), an open-source framework for reproducible benchmarking, and iv) conducting extensive quantitative experiments to evaluate representative networks and sim-to-real learning strategies. Our findings high...
VI: Việc mô tả chính xác hình thái thực vật cung cấp những hiểu biết giá trị về tương tác giữa thực vật và môi trường, cũng như sự tiến hóa di truyền. Một công nghệ quan trọng để trích xuất thông tin này là phân đoạn 3D, giúp phân định các cơ quan riêng lẻ của cây từ các đám mây điểm phức tạp. Mặc dù đã có những tiến bộ đáng kể trong các lĩnh vực thị giác máy tính 3D nói chung, việc áp dụng phân đoạn 3D cho kiểu hình thực vật vẫn còn hạn chế bởi ba thách thức chính: i) sự khan hiếm của các tập dữ liệu được chú thích quy mô lớn, ii) những khó khăn kỹ thuật trong việc điều chỉnh các mạng nơ-ron sâu tiên tiến cho các đám mây điểm thực vật và iii) sự thiếu hụt các chuẩn mực và giao thức đánh giá tiêu chuẩn được điều chỉnh cho khoa học thực vật. Bài đánh giá này giải quyết một cách có hệ thống những rào cản này bằng cách: i) cung cấp tổng quan về các tập dữ liệu thực vật 3D hiện có trong bối cảnh các lĩnh vực phân đoạn 3D nói chung, ii) tóm tắt một cách có hệ thống các phương pháp dựa trên học sâu để phân đoạn ngữ nghĩa và phân đoạn đối tượng đám mây điểm, iii) giới thiệu Plant Segmentation Studio (PSS), một khung mã nguồn mở để đánh giá điểm chuẩn có thể tái tạo và iv) thực hiện các thử nghiệm định lượng sâu rộng để đánh giá các mạng đại diện và các chiến lược học sim-to-real. Những phát hiện của chúng tôi làm nổi bật hiệu quả của các xương sống tích chập thưa thớt và phân đoạn đối tượng dựa trên biến đổi, đồng thời nhấn mạnh vai trò bổ sung của việc tạo dữ liệu tổng hợp dựa trên mô hình và dựa trên tăng cường để học sim-to-real trong việc giảm nhu cầu chú thích. Nói chung, nghiên cứu này thu hẹp khoảng cách giữa những tiến bộ thuật toán và triển khai thực tế, cung cấp các công cụ tức thì cho các nhà nghiên cứu và lộ trình để phát triển các giải pháp học sâu tiết kiệm dữ liệu và có khả năng tổng quát hóa trong việc tạo kiểu hình thực vật 3D. Dữ liệu và mã có sẵn tại https://github.com/perrydoremi/PlantSegStudio.
SegFormer Fine-Tuning with Dropout: Advancing Hair Artifact Removal in Skin Lesion Analysis.
EN: Hair artifacts in dermoscopic images present significant challenges for accurate skin lesion analysis, potentially obscuring critical diagnostic features in dermatological assessments. This work introduces a fine-tuned SegFormer model augmented with dropout regularization to achieve precise hair mask segmentation. The proposed SegformerWithDropout architecture leverages the MiT-B2 encoder, pretrained on ImageNet, with an in-channel count of 3 and 2 output classes, incorporating a dropout probability of 0.3 in the segmentation head to prevent overfitting. Training is conducted on a specialized dataset of 500 dermoscopic skin lesion images with fine-grained hair mask annotations, employing 10-fold cross-validation, AdamW optimization with a learning rate of 0.001, and cross-entropy loss. Early stopping is applied based on validation loss, with a patience of 3 epochs and a maximum of 20 epochs per fold. Performance is evaluated using a comprehensive suite of metrics, including Intersection over Union (IoU), Dice coefficient, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). Experimental results from the cross-...
VI: Các tạo tác tóc trong ảnh soi da gây ra những thách thức đáng kể cho việc phân tích chính xác tổn thương da, có khả năng che khuất các đặc điểm chẩn đoán quan trọng trong đánh giá da liễu. Nghiên cứu này giới thiệu mô hình SegFormer được tinh chỉnh, tăng cường thêm điều chuẩn dropout để đạt được phân đoạn mặt nạ tóc chính xác. Kiến trúc SegformerWithDropout được đề xuất tận dụng bộ mã hóa MiT-B2, được huấn luyện trước trên ImageNet, với số lượng kênh đầu vào là 3 và 2 lớp đầu ra, kết hợp xác suất dropout là 0.3 trong phần đầu phân đoạn để ngăn ngừa overfitting. Quá trình huấn luyện được thực hiện trên tập dữ liệu chuyên biệt gồm 500 ảnh soi da tổn thương da với chú thích mặt nạ tóc chi tiết, sử dụng cross-validation 10 lần, tối ưu hóa AdamW với tốc độ học 0.001 và hàm mất mát cross-entropy. Early stopping được áp dụng dựa trên mất mát xác thực, với độ kiên nhẫn là 3 epochs và tối đa 20 epochs cho mỗi fold. Hiệu suất được đánh giá bằng một bộ chỉ số toàn diện, bao gồm Intersection over Union (IoU), Dice coefficient, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) và Learned Perceptual Image Patch Similarity (LPIPS). Kết quả thử nghiệm từ cross-validation cho thấy hiệu suất mạnh mẽ, với hệ số Dice trung bình đạt khoảng 0.96 và giá trị IoU là 0.93, cùng với PSNR thuận lợi (khoảng 34 dB), SSIM (0.97) và LPIPS thấp (0.06), làm nổi bật hiệu quả của mô hình trong việc phân đoạn chính xác tạo tác tóc và tiềm năng của nó trong việc tăng cường tiền xử lý cho các tác vụ phát hiện ung thư da ở hạ nguồn.
Prior-Guided Flow Matching for Target-Aware Molecule Design with Learnable Atom Number.
EN: Structure-based drug design (SBDD), aiming to generate 3D molecules with high binding affinity toward target proteins, is a vital approach in novel drug discovery. Although recent generative models have shown great potential, they suffer from unstable probability dynamics and mismatch between generated molecule size and the protein pockets geometry, resulting in inconsistent quality and off-target effects. We propose PAFlow, a novel target-aware molecular generation model featuring prior interaction guidance and a learnable atom number predictor. PAFlow adopts the efficient flow matching framework to model the generation process and constructs a new form of conditional flow matching for discrete atom types. A protein-ligand interaction predictor is incorporated to guide the vector field toward higher-affinity regions during generation, while an atom number predictor based on protein pocket information is designed to better align generated molecule size with target geometry. Extensive experiments on the CrossDocked2020 benchmark show that PAFlow achieves a new state-of-the-art in binding affinity (up to -8.31 Avg. Vina Score), simultaneously maintains favorable molecular properties.
VI: Thiết kế thuốc dựa trên cấu trúc (SBDD), với mục tiêu tạo ra các phân tử 3D có ái lực liên kết cao với protein đích, là một phương pháp quan trọng trong việc khám phá thuốc mới. Mặc dù các mô hình sinh gần đây đã cho thấy tiềm năng lớn, nhưng chúng gặp phải động lực học xác suất không ổn định và sự không phù hợp giữa kích thước phân tử được tạo ra và hình học của các túi protein, dẫn đến chất lượng không nhất quán và các tác dụng ngoài mục tiêu. Chúng tôi đề xuất PAFlow, một mô hình tạo phân tử nhận biết mục tiêu mới, có hướng dẫn tương tác tiên nghiệm và bộ dự đoán số lượng nguyên tử có thể học được. PAFlow áp dụng khung khớp dòng hiệu quả để mô hình hóa quá trình tạo và xây dựng một dạng khớp dòng có điều kiện mới cho các loại nguyên tử rời rạc. Một bộ dự đoán tương tác protein-ligand được tích hợp để hướng dẫn trường vectơ đến các vùng có ái lực cao hơn trong quá trình tạo, trong khi một bộ dự đoán số lượng nguyên tử dựa trên thông tin túi protein được thiết kế để căn chỉnh kích thước phân tử được tạo ra với hình học mục tiêu tốt hơn. Các thử nghiệm mở rộng trên bộ dữ liệu chuẩn CrossDocked2020 cho thấy PAFlow đạt được trạng thái hiện đại mới về ái lực liên kết (lên đến -8,31 Avg. Vina Score), đồng thời duy trì các đặc tính phân tử thuận lợi.
CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA.
EN: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges pe...
VI: Các mô hình ngôn ngữ lớn (LLM) ngày càng chứng tỏ khả năng trả lời câu hỏi chính xác trong nhiều lĩnh vực khác nhau. Tuy nhiên, việc đánh giá nghiêm ngặt hiệu suất của chúng về khả năng trả lời câu hỏi phức tạp (QA) là rất cần thiết trước khi triển khai trong các ứng dụng y sinh và chăm sóc sức khỏe thực tế. Bài báo này trình bày cách tiếp cận của chúng tôi đối với track MedHopQA của nhiệm vụ chia sẻ BioCreative IX, tập trung vào trả lời câu hỏi y sinh đa bước liên quan đến bệnh tật, gen và hóa chất. Chúng tôi áp dụng chiến lược tinh chỉnh có giám sát tận dụng LLaMA 3 8B, được tăng cường bằng một bộ dữ liệu hỏi đáp y sinh được tuyển chọn từ các nguồn bên ngoài bao gồm BioASQ, MedQuAD và TREC. Ba thiết lập thử nghiệm được khám phá: tinh chỉnh trên câu trả lời ngắn và dài kết hợp, chỉ câu trả lời ngắn và chỉ câu trả lời dài. Mặc dù các mô hình của chúng tôi thể hiện sự hiểu biết sâu sắc về lĩnh vực này, đạt được điểm độ chính xác cấp khái niệm lên đến 0,8, nhưng điểm Exact Match (EM) của chúng vẫn thấp hơn đáng kể, đặc biệt là trong giai đoạn thử nghiệm. Chúng tôi giới thiệu một quy trình suy luận hai giai đoạn để trích xuất câu trả lời ngắn chính xác nhằm giảm thiểu sự dài dòng và cải thiện sự phù hợp với các chỉ số đánh giá. Mặc dù có những cải tiến một phần, nhưng những thách thức vẫn tồn tại trong việc tạo ra các đầu ra được định dạng nghiêm ngặt. Những phát hiện của chúng tôi làm nổi bật khoảng cách giữa sự hiểu biết ngữ nghĩa và đánh giá câu trả lời chính xác trong các ứng dụng LLM y sinh, thúc đẩy nghiên cứu sâu hơn về kiểm soát đầu ra và các chiến lược hậu xử lý.
CaresAI at BioCreative IX Track 1 -- LLM for Biomedical QA.
EN: Large language models (LLMs) are increasingly evident for accurate question answering across various domains. However, rigorous evaluation of their performance on complex question-answering (QA) capabilities is essential before deployment in real-world biomedical and healthcare applications. This paper presents our approach to the MedHopQA track of the BioCreative IX shared task, which focuses on multi-hop biomedical question answering involving diseases, genes, and chemicals. We adopt a supervised fine-tuning strategy leveraging LLaMA 3 8B, enhanced with a curated biomedical question-answer dataset compiled from external sources including BioASQ, MedQuAD, and TREC. Three experimental setups are explored: fine-tuning on combined short and long answers, short answers only, and long answers only. While our models demonstrate strong domain understanding, achieving concept-level accuracy scores of up to 0.8, their Exact Match (EM) scores remain significantly lower, particularly in the test phase. We introduce a two-stage inference pipeline for precise short-answer extraction to mitigate verbosity and improve alignment with evaluation metrics. Despite partial improvements, challenges pe...
VI: Các mô hình ngôn ngữ lớn (LLM) ngày càng thể hiện rõ khả năng trả lời câu hỏi chính xác trong nhiều lĩnh vực khác nhau. Tuy nhiên, việc đánh giá nghiêm ngặt hiệu suất của chúng về khả năng trả lời câu hỏi phức tạp (QA) là rất cần thiết trước khi triển khai trong các ứng dụng y sinh và chăm sóc sức khỏe thực tế. Bài báo này trình bày phương pháp tiếp cận của chúng tôi đối với MedHopQA track của BioCreative IX shared task, tập trung vào việc trả lời câu hỏi y sinh đa bước liên quan đến bệnh tật, gen và hóa chất. Chúng tôi áp dụng chiến lược tinh chỉnh có giám sát tận dụng LLaMA 3 8B, được tăng cường bằng một bộ dữ liệu câu hỏi-trả lời y sinh được tuyển chọn từ các nguồn bên ngoài bao gồm BioASQ, MedQuAD và TREC. Ba thiết lập thử nghiệm được khám phá: tinh chỉnh trên các câu trả lời ngắn và dài kết hợp, chỉ các câu trả lời ngắn và chỉ các câu trả lời dài. Mặc dù các mô hình của chúng tôi thể hiện sự hiểu biết sâu sắc về lĩnh vực, đạt được điểm chính xác ở cấp độ khái niệm lên đến 0,8, nhưng điểm Exact Match (EM) của chúng vẫn thấp hơn đáng kể, đặc biệt là trong giai đoạn thử nghiệm. Chúng tôi giới thiệu quy trình suy luận hai giai đoạn để trích xuất câu trả lời ngắn chính xác nhằm giảm thiểu tính dài dòng và cải thiện sự phù hợp với các chỉ số đánh giá. Mặc dù có những cải tiến một phần, nhưng những thách thức vẫn tồn tại trong việc tạo ra các đầu ra được định dạng nghiêm ngặt. Những phát hiện của chúng tôi nhấn mạnh khoảng cách giữa sự hiểu biết ngữ nghĩa và đánh giá câu trả lời chính xác trong các ứng dụng LLM y sinh, thúc đẩy nghiên cứu sâu hơn về kiểm soát đầu ra và các chiến lược hậu xử lý.
Controllable 3D Molecular Generation for Structure-Based Drug Design Through Bayesian Flow Networks and Gradient Integration.
EN: Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose CByG, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed CByG framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world...
In-vitro Anti-bacterial Activity of Methanol and Aqueous Crude Extracts of Horsfieldia iryaghedhi.
EN: Aims: Over the past two decades, the rise of multidrug resistance (MDR) in bacteria has posed a significant threat to global health. The urgent need for new treatment alternatives has brought attention to the potential of plants, which harbor a wealth of unexplored phytochemicals with therapeutic properties. This study aims to evaluate the anti-bacterial efficacy of methanol and aqueous extracts from the leaves and bark of Horsfieldia iryaghedhi In vitro. Methodology: Aqueous and methanol extracts were obtained from the cold maceration method. In vitro anti-bacterial activity of methanol and aqueous leaf, bark, and combination extracts were determined against gram-negative bacteria Escherichia coli (ATCC 25922) and gram-positive bacteria Staphylococcus aureus (ATCC25923). The anti-bacterial assay for different concentrations of each extract was conducted through the well-diffusion method, with Gentamycin serving as the positive control. Results: Methanol leaf and combination extracts of Horsfieldia iryaghedhi have shown a positive anti-bacterial response at their highest concentrations of 1000mcg/mL and 500mcg/mL against grampositive bacteria Staphylococcus aureus while none of the...
Predicting Drug-Drug Interactions Using Heterogeneous Graph Neural Networks: HGNN-DDI.
EN: Drug-drug interactions (DDIs) are a major concern in clinical practice, as they can lead to reduced therapeutic efficacy or severe adverse effects. Traditional computational approaches often struggle to capture the complex relationships among drugs, targets, and biological entities. In this work, we propose HGNN-DDI, a heterogeneous graph neural network model designed to predict potential DDIs by integrating multiple drug-related data sources. HGNN-DDI leverages graph representation learning to model heterogeneous biomedical networks, enabling effective information propagation across diverse node and edge types. Experimental results on benchmark DDI datasets demonstrate that HGNN-DDI outperforms state-of-the-art baselines in prediction accuracy and robustness, highlighting its potential to support safer drug development and precision medicine.
Molecular Tools for Non-Planar Surface Chemistry.
EN: Scanning probe microscopy (SPM) investigations of on-surface chemistry on passivated silicon have only shown in-plane chemical reactions, and studies on bare silicon are limited in facilitating additional reactions post-molecular-attachment. Here, we enable subsequent reactions on Si(100) through selectively adsorbing 3D, silicon-specific "molecular tools". Following an activation step, the molecules present an out-of-plane radical that can function both to donate or accept molecular fragments, thereby enabling applications across multiple scales, e.g., macroscale customizable silicon-carbon coatings or nanoscale tip-mediated mechanosynthesis. Creation of many such molecular tools is enabled by broad molecular design criteria that facilitate reproducibility, surface specificity, and experimental verifiability. These criteria are demonstrated using a model molecular tool tetrakis(iodomethyl)germane ($Ge(CH_{2}I)_{4}$; TIMe-Ge), with experimental validation by SPM and X-ray photoelectron spectroscopy (XPS), and theoretical support by density functional theory (DFT) investigations. With this framework, a broad and diverse range of new molecular engineering capabilities are enabled on ...
Molecular Tools for Non-Planar Surface Chemistry.
EN: Scanning probe microscopy (SPM) investigations of on-surface chemistry on passivated silicon have only shown in-plane chemical reactions, and studies on bare silicon are limited in facilitating additional reactions post-molecular-attachment. Here, we enable subsequent reactions on Si(100) through selectively adsorbing 3D, silicon-specific "molecular tools". Following an activation step, the molecules present an out-of-plane radical that can function both to donate or accept molecular fragments, thereby enabling applications across multiple scales, e.g., macroscale customizable silicon-carbon coatings or nanoscale tip-mediated mechanosynthesis. Creation of many such molecular tools is enabled by broad molecular design criteria that facilitate reproducibility, surface specificity, and experimental verifiability. These criteria are demonstrated using a model molecular tool tetrakis(iodomethyl)germane ($Ge(CH_{2}I)_{4}$; TIMe-Ge), with experimental validation by SPM and X-ray photoelectron spectroscopy (XPS), and theoretical support by density functional theory (DFT) investigations. With this framework, a broad and diverse range of new molecular engineering capabilities are enabled on ...
Molecular Tools for Non-Planar Surface Chemistry.
EN: Scanning probe microscopy (SPM) investigations of on-surface chemistry on passivated silicon have only shown in-plane chemical reactions, and studies on bare silicon are limited in facilitating additional reactions post-molecular-attachment. Here, we enable subsequent reactions on Si(100) through selectively adsorbing 3D, silicon-specific "molecular tools". Following an activation step, the molecules present an out-of-plane radical that can function both to donate or accept molecular fragments, thereby enabling applications across multiple scales, e.g., macroscale customizable silicon-carbon coatings or nanoscale tip-mediated mechanosynthesis. Creation of many such molecular tools is enabled by broad molecular design criteria that facilitate reproducibility, surface specificity, and experimental verifiability. These criteria are demonstrated using a model molecular tool tetrakis(iodomethyl)germane ($Ge(CH_{2}I)_{4}$; TIMe-Ge), with experimental validation by SPM and X-ray photoelectron spectroscopy (XPS), and theoretical support by density functional theory (DFT) investigations. With this framework, a broad and diverse range of new molecular engineering capabilities are enabled on ...
Molecular Tools for Non-Planar Surface Chemistry.
EN: Scanning probe microscopy (SPM) investigations of on-surface chemistry on passivated silicon have only shown in-plane chemical reactions, and studies on bare silicon are limited in facilitating additional reactions post-molecular-attachment. Here, we enable subsequent reactions on Si(100) through selectively adsorbing 3D, silicon-specific "molecular tools". Following an activation step, the molecules present an out-of-plane radical that can function both to donate or accept molecular fragments, thereby enabling applications across multiple scales, e.g., macroscale customizable silicon-carbon coatings or nanoscale tip-mediated mechanosynthesis. Creation of many such molecular tools is enabled by broad molecular design criteria that facilitate reproducibility, surface specificity, and experimental verifiability. These criteria are demonstrated using a model molecular tool tetrakis(iodomethyl)germane ($Ge(CH_{2}I)_{4}$; TIMe-Ge), with experimental validation by SPM and X-ray photoelectron spectroscopy (XPS), and theoretical support by density functional theory (DFT) investigations. With this framework, a broad and diverse range of new molecular engineering capabilities are enabled on ...
ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine.
EN: Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception acro...
TOM: An Open-Source Tongue Segmentation Method with Multi-Teacher Distillation and Task-Specific Data Augmentation.
EN: Tongue imaging serves as a valuable diagnostic tool, particularly in Traditional Chinese Medicine (TCM). The quality of tongue surface segmentation significantly affects the accuracy of tongue image classification and subsequent diagnosis in intelligent tongue diagnosis systems. However, existing research on tongue image segmentation faces notable limitations, and there is a lack of robust and user-friendly segmentation tools. This paper proposes a tongue image segmentation model (TOM) based on multi-teacher knowledge distillation. By incorporating a novel diffusion-based data augmentation method, we enhanced the generalization ability of the segmentation model while reducing its parameter size. Notably, after reducing the parameter count by 96.6% compared to the teacher models, the student model still achieves an impressive segmentation performance of 95.22% mIoU. Furthermore, we packaged and deployed the trained model as both an online and offline segmentation tool (available at https://itongue.cn/), allowing TCM practitioners and researchers to use it without any programming experience. We also present a case study on TCM constitution classification using segmented tongue patche...
Ion adsorption and zeta potential of hydrophobic interfaces.
EN: Hydrophobic interfaces have unique physicochemical properties and are used in various chemical products such as food, cosmetics, soap, and medicine and technologies such as pan coating and ski wax. In this chapter, we describe the fundamental concept of hydrophobic interfaces and explain their ion adsorption and zeta potential by using experimental data from the literature. Thus far, these electrical properties are considered universal for solid/water, liquid/water, and gas/water interfaces; however, a careful comparison in this chapter will reveal significant differences among them. To confirm that the affinity of H$^+$ ions for all hydrophobic interfaces is stronger than that of OH$^-$ ions, more experimental data on hydrophobic liquid/water and solid/water interfaces are required.
Deep Learning-based QSAR Model for Therapeutic Strategies Targeting SmTGR Protein's Immune Modulating Role in Host-Parasite Interaction.
EN: Schistosomiasis, a neglected tropical disease caused by Schistosoma parasites, remains a major global health challenge. The Schistosoma mansoni thioredoxin glutathione reductase (SmTGR) is essential for parasite redox balance and immune evasion, making it a key therapeutic target. This study employs predictive Quantitative Structure-Activity Relationship (QSAR) modeling to identify potential SmTGR inhibitors. Using deep learning, a robust QSAR model was developed and validated, achieving high predictive accuracy. The predicted novel inhibitors were further validated through molecular docking studies, which demonstrated strong binding affinities, with the highest docking score of -10.76+-0.01kcal/mol. Visualization of the docked structures in both 2D and 3D confirmed similar interactions for the inhibitors and commercial drugs, further supporting their therapeutic effectiveness and the predictive ability of the model. This study demonstrates the potential of QSAR modeling in accelerating drug discovery, offering a promising avenue for developing novel therapeutics targeting SmTGR to improve schistosomiasis treatment.
An Explainable AI based approach for Monitoring Animal Health.
EN: Monitoring cattle health and optimizing yield are key challenges faced by dairy farmers due to difficulties in tracking all animals on the farm. This work aims to showcase modern data-driven farming practices based on explainable machine learning(ML) methods that explain the activity and behaviour of dairy cattle (cows). Continuous data collection of 3-axis accelerometer sensors and usage of robust ML methodologies and algorithms, provide farmers and researchers with actionable information on cattle activity, allowing farmers to make informed decisions and incorporate sustainable practices. This study utilizes Bluetooth-based Internet of Things (IoT) devices and 4G networks for seamless data transmission, immediate analysis, inference generation, and explains the models performance with explainability frameworks. Special emphasis is put on the pre-processing of the accelerometers time series data, including the extraction of statistical characteristics, signal processing techniques, and lag-based features using the sliding window technique. Various hyperparameter-optimized ML models are evaluated across varying window lengths for activity classification. The k-nearest neighbour Cla...
CWFBind: Geometry-Awareness for Fast and Accurate Protein-Ligand Docking.
EN: Accurately predicting the binding conformation of small-molecule ligands to protein targets is a critical step in rational drug design. Although recent deep learning-based docking surpasses traditional methods in speed and accuracy, many approaches rely on graph representations and language model-inspired encoders while neglecting critical geometric information, resulting in inaccurate pocket localization and unrealistic binding conformations. In this study, we introduce CWFBind, a weighted, fast, and accurate docking method based on local curvature features. Specifically, we integrate local curvature descriptors during the feature extraction phase to enrich the geometric representation of both proteins and ligands, complementing existing chemical, sequence, and structural features. Furthermore, we embed degree-aware weighting mechanisms into the message passing process, enhancing the model's ability to capture spatial structural distinctions and interaction strengths. To address the class imbalance challenge in pocket prediction, CWFBind employs a ligand-aware dynamic radius strategy alongside an enhanced loss function, facilitating more precise identification of binding regions a...
Quantifying the direct and indirect impact of COVID-19 vaccination: evidence from Victoria, Australia.
EN: Vaccines not only directly protect vaccinated individuals but also contribute to protect the entire population via indirect herd-immunity benefits. However, researchers have long struggled to quantify these indirect effects at the population level, hindering assessment of vaccination program effectiveness. We developed a new method to estimate these effects, thereby markedly improving measures of the number of infections, hospitalizations, and deaths averted by vaccination. Our population-based analysis of 6,440,000 residents of Victoria, Australia reveal strong indirect effects during the Delta outbreak (September-November 2021). By modelling a non-vaccination counterfactual, we conservatively estimate 316,000 infections were averted (95\% BCI: 232k-406k), as well as 33,500 hospitalizations (95\% BCI: 22.2k-46.2k), and 4,900 deaths (95\% BCI: 2.9k-7.3k). These are 4.0, 7.5, and 8.0 times higher, respectively, than observed. Half of the averted infections and around one-quarter of hospitalizations and deaths were attributable to indirect protection. Homogeneous vaccination across LGAs could have reduced outcomes by approximately 25\%.
Production and spectroscopy of cold radioactive molecules.
EN: Molecules with heavy, radioactive nuclei promise extreme sensitivity to fundamental nuclear and particle physics. However, these nuclei are available in limited quantities, which challenges their use in precision measurements. Here we demonstrate the gas-phase synthesis, cryogenic cooling, and high-resolution laser spectroscopy of radium monohydroxide, monodeuteroxide, and monofluoride molecules ($^{226}$RaOH, $^{226}$RaOD, and $^{226}$RaF) in a tabletop apparatus by combining novel radioactive target production protocols, optically driven chemistry in a cryogenic buffer gas, and low-background spectroscopic detection methods. The molecules are cooled in the lab frame, creating conditions that are the same starting points as many current molecular precision measurement and quantum information experiments. This approach is readily applied to a wide range of species and establishes key capabilities for molecular quantum sensing of exotic nuclei.
Production and spectroscopy of cold radioactive molecules.
EN: Molecules with heavy, radioactive nuclei promise extreme sensitivity to fundamental nuclear and particle physics. However, these nuclei are available in limited quantities, which challenges their use in precision measurements. Here we demonstrate the gas-phase synthesis, cryogenic cooling, and high-resolution laser spectroscopy of radium monohydroxide, monodeuteroxide, and monofluoride molecules ($^{226}$RaOH, $^{226}$RaOD, and $^{226}$RaF) in a tabletop apparatus by combining novel radioactive target production protocols, optically driven chemistry in a cryogenic buffer gas, and low-background spectroscopic detection methods. The molecules are cooled in the lab frame, creating conditions that are the same starting points as many current molecular precision measurement and quantum information experiments. This approach is readily applied to a wide range of species and establishes key capabilities for molecular quantum sensing of exotic nuclei.
Production and spectroscopy of cold radioactive molecules.
EN: Molecules with heavy, radioactive nuclei promise extreme sensitivity to fundamental nuclear and particle physics. However, these nuclei are available in limited quantities, which challenges their use in precision measurements. Here we demonstrate the gas-phase synthesis, cryogenic cooling, and high-resolution laser spectroscopy of radium monohydroxide, monodeuteroxide, and monofluoride molecules ($^{226}$RaOH, $^{226}$RaOD, and $^{226}$RaF) in a tabletop apparatus by combining novel radioactive target production protocols, optically driven chemistry in a cryogenic buffer gas, and low-background spectroscopic detection methods. The molecules are cooled in the lab frame, creating conditions that are the same starting points as many current molecular precision measurement and quantum information experiments. This approach is readily applied to a wide range of species and establishes key capabilities for molecular quantum sensing of exotic nuclei.
Generative Artificial Intelligence Extracts Structure-Function Relationships from Plants for New Materials.
EN: Large language models (LLMs) have reshaped the research landscape by enabling new approaches to knowledge retrieval and creative ideation. Yet their application in discipline-specific experimental science, particularly in highly multi-disciplinary domains like materials science, remains limited. We present a first-of-its-kind framework that integrates generative AI with literature from hitherto-unconnected fields such as plant science, biomimetics, and materials engineering to extract insights and design experiments for materials. We focus on humidity-responsive systems such as pollen-based materials and Rhapis excelsa (broadleaf lady palm) leaves, which exhibit self-actuation and adaptive performance. Using a suite of AI tools, including a fine-tuned model (BioinspiredLLM), Retrieval-Augmented Generation (RAG), agentic systems, and a Hierarchical Sampling strategy, we extract structure-property relationships and translate them into new classes of bioinspired materials. Structured inference protocols generate and evaluate hundreds of hypotheses from a single query, surfacing novel and experimentally tractable ideas. We validate our approach through real-world implementation: LLM-ge...
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning.
EN: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning.
EN: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning.
EN: Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.
SCOUT: An in-vivo Methane Sensing System for Real-time Monitoring of Enteric Emissions in Cattle with ex-vivo Validation.
EN: Accurate measurement of enteric methane emissions remains a critical bottleneck for advancing livestock sustainability through genetic selection and precision management. Existing ambient sampling approaches suffer from low data retention rates, environmental interference, and limited temporal resolution. We developed SCOUT (Smart Cannula-mounted Optical Unit for Trace-methane), the first robust in-vivo sensing system enabling continuous, high-resolution monitoring of ruminal methane concentrations through an innovative closed-loop gas recirculation design. We conducted comprehensive validation with two cannulated Simmental heifers under contrasting dietary treatments, with cross-platform comparison against established ambient sniffer systems. SCOUT achieved exceptional performance with 82% data retention compared to 17% for conventional sniffer systems, while capturing methane concentrations 100-1000x higher than ambient approaches. Cross-platform validation demonstrated strong scale-dependent correlations, with optimal correlation strength (r = -0.564 $\pm$ 0.007) at biologically relevant 40-minute windows and 100% statistical significance. High-frequency monitoring revealed nove...
TCDiff: Triplex Cascaded Diffusion for High-fidelity Multimodal EHRs Generation with Incomplete Clinical Data.
EN: The scarcity of large-scale and high-quality electronic health records (EHRs) remains a major bottleneck in biomedical research, especially as large foundation models become increasingly data-hungry. Synthesizing substantial volumes of de-identified and high-fidelity data from existing datasets has emerged as a promising solution. However, existing methods suffer from a series of limitations: they struggle to model the intrinsic properties of heterogeneous multimodal EHR data (e.g., continuous, discrete, and textual modalities), capture the complex dependencies among them, and robustly handle pervasive data incompleteness. These challenges are particularly acute in Traditional Chinese Medicine (TCM). To this end, we propose TCDiff (Triplex Cascaded Diffusion Network), a novel EHR generation framework that cascades three diffusion networks to learn the features of real-world EHR data, formatting a multi-stage generative process: Reference Modalities Diffusion, Cross-Modal Bridging, and Target Modality Diffusion. Furthermore, to validate our proposed framework, besides two public datasets, we also construct and introduce TCM-SZ1, a novel multimodal EHR dataset for benchmarking. Exper...
Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning.
EN: This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.
Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI.
EN: Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.
Doctor Sun: A Bilingual Multimodal Large Language Model for Biomedical AI.
EN: Large multimodal models (LMMs) have demonstrated significant potential in providing innovative solutions for various biomedical tasks, including pathology analysis, radiology report generation, and biomedical assistance. However, the existing multimodal biomedical AI is typically based on foundation LLMs, thus hindering the understanding of intricate medical concepts with limited medical training data. Moreover, recent LLaVA-induced medical LMMs struggle to effectively capture the intricate relationship between the texts and the images. Therefore, we introduce Doctor Sun, a large multimodal generative model specialized in medicine, developed to encode, integrate, and interpret diverse biomedical data modalities such as text and images. In particular, Doctor Sun integrates a pre-trained vision encoder with a medical LLM and conducts two-stage training on various medical datasets, focusing on feature alignment and instruction tuning. Moreover, we release SunMed-VL, a wide-range bilingual medical multimodal dataset, along with all associated models, code, and resources, to freely support the advancement of biomedical multimodal research.
Classification of Honey Botanical and Geographical Sources using Mineral Profiles and Machine Learning.
EN: This paper proposes a machine learning-based approach for identifying honey floral and geographical sources using mineral element profiles. The proposed method comprises two steps: preprocessing and classification. The preprocessing phase involves missing-value treatment and data normalization. In the classification phase, we employ various supervised classification models for discriminating between six botanical sources and 13 geographical origins of honey. We test the classifiers' performance on a publicly available honey mineral element dataset. The dataset contains mineral element profiles of honeys from various floral and geographical origins. Results show that mineral element content in honey provides discriminative information useful for classifying honey botanical and geographical sources. Results also show that the Random Forests (RF) classifier obtains the best performance on this dataset, achieving a cross-validation accuracy of 99.30% for classifying honey botanical origins and 98.01% for classifying honey geographical origins.
Behavior-Specific Filtering for Enhanced Pig Behavior Classification in Precision Livestock Farming.
EN: This study proposes a behavior-specific filtering method to improve behavior classification accuracy in Precision Livestock Farming. While traditional filtering methods, such as wavelet denoising, achieved an accuracy of 91.58%, they apply uniform processing to all behaviors. In contrast, the proposed behavior-specific filtering method combines Wavelet Denoising with a Low Pass Filter, tailored to active and inactive pig behaviors, and achieved a peak accuracy of 94.73%. These results highlight the effectiveness of behavior-specific filtering in enhancing animal behavior monitoring, supporting better health management and farm efficiency.
AQUA: A Large Language Model for Aquaculture & Fisheries.
EN: Aquaculture plays a vital role in global food security and coastal economies by providing sustainable protein sources. As the industry expands to meet rising demand, it faces growing challenges such as disease outbreaks, inefficient feeding practices, rising labor costs, logistical inefficiencies, and critical hatchery issues, including high mortality rates and poor water quality control. Although artificial intelligence has made significant progress, existing machine learning methods fall short of addressing the domain-specific complexities of aquaculture. To bridge this gap, we introduce AQUA, the first large language model (LLM) tailored for aquaculture, designed to support farmers, researchers, and industry practitioners. Central to this effort is AQUADAPT (Data Acquisition, Processing and Tuning), an Agentic Framework for generating and refining high-quality synthetic data using a combination of expert knowledge, largescale language models, and automated evaluation techniques. Our work lays the foundation for LLM-driven innovations in aquaculture research, advisory systems, and decision-making tools.
Generative molecule evolution using 3D pharmacophore for efficient Structure-Based Drug Design.
EN: Recent advances in generative models, particularly diffusion and auto-regressive models, have revolutionized fields like computer vision and natural language processing. However, their application to structure-based drug design (SBDD) remains limited due to critical data constraints. To address the limitation of training data for models targeting SBDD tasks, we propose an evolutionary framework named MEVO, which bridges the gap between billion-scale small molecule dataset and the scarce protein-ligand complex dataset, and effectively increase the abundance of training data for generative SBDD models. MEVO is composed of three key components: a high-fidelity VQ-VAE for molecule representation in latent space, a diffusion model for pharmacophore-guided molecule generation, and a pocket-aware evolutionary strategy for molecule optimization with physics-based scoring function. This framework efficiently generate high-affinity binders for various protein targets, validated with predicted binding affinities using free energy perturbation (FEP) methods. In addition, we showcase the capability of MEVO in designing potent inhibitors to KRAS$^{\textrm{G12D}}$, a challenging target in cancer ...
Toward Routine CSP of Pharmaceuticals: A Fully Automated Protocol Using Neural Network Potentials.
EN: Crystal structure prediction (CSP) is a useful tool in pharmaceutical development for identifying and assessing risks associated with polymorphism, yet widespread adoption has been hindered by high computational costs and the need for both manual specification and expert knowledge to achieve useful results. Here, we introduce a fully automated, high-throughput CSP protocol designed to overcome these barriers. The protocol's efficiency is driven by Lavo-NN, a novel neural network potential (NNP) architected and trained specifically for pharmaceutical crystal structure generation and ranking. This NNP-driven crystal generation phase is integrated into a scalable cloud-based workflow. We validate this CSP protocol on an extensive retrospective benchmark of 49 unique molecules, almost all of which are drug-like, successfully generating structures that match all 110 $Z' = 1$ experimental polymorphs. The average CSP in this benchmark is performed with approximately 8.4k CPU hours, which is a significant reduction compared to other protocols. The practical utility of the protocol is further demonstrated through case studies that resolve ambiguities in experimental data and a semi-blinded ...
Breaking the picomolar barrier in lateral flow assays using Bright-Dtech___ 614 -- Europium nanoparticles for enhanced sensitivity.
EN: Lateral flow immunoassays (LFIA) are among the most widely used rapid diagnostic tests for point-of-care screening of disease biomarkers. However, their limited sensitivity hinders their use in complex clinical applications that require accurate biomarker quantification for precise medicine. To address this limitation, we evaluated Bright-Dtech-614 Europium nanoparticles to enhance LFIA assay sensitivity. These nanoparticles exhibited a luminescence quantum yield of 70 % and a 90 % conjugation efficacy with antibodies by direct adsorption. Considering these properties, we developed an LFIA to quantify human lactate dehydrogenase (h-LDH), a biomarker and therapeutic target in cancer disease. The Bright-Dtech-614 Eu nanoparticle-based assay achieved a detection limit of 38 pg mL -1 , representing a 686-fold, 15-fold, and 2.9-fold improvement in sensitivity over conventional LFIA platforms using gold (AuNPs), carbon nanoparticles, and standard ELISA, respectively. The assay exhibited strong accuracy, with a mean recovery rate of 108 $\pm$ 11 %, and demonstrated excellent reproducibility, as evidenced by inter-and intra-batch RSD values of 4.9 % and 9.7 %, respectively, when test...
A Graph-in-Graph Learning Framework for Drug-Target Interaction Prediction.
EN: Accurately predicting drug-target interactions (DTIs) is pivotal for advancing drug discovery and target validation techniques. While machine learning approaches including those that are based on Graph Neural Networks (GNN) have achieved notable success in DTI prediction, many of them have difficulties in effectively integrating the diverse features of drugs, targets and their interactions. To address this limitation, we introduce a novel framework to take advantage of the power of both transductive learning and inductive learning so that features at molecular level and drug-target interaction network level can be exploited. Within this framework is a GNN-based model called Graph-in-Graph (GiG) that represents graphs of drug and target molecular structures as meta-nodes in a drug-target interaction graph, enabling a detailed exploration of their intricate relationships. To evaluate the proposed model, we have compiled a special benchmark comprising drug SMILES, protein sequences, and their interaction data, which is interesting in its own right. Our experimental results demonstrate that the GiG model significantly outperforms existing approaches across all evaluation metrics, highl...
Lightweight Model for Poultry Disease Detection from Fecal Images Using Multi-Color Space Feature Optimization and Machine Learning.
EN: Poultry farming is a vital component of the global food supply chain, yet it remains highly vulnerable to infectious diseases such as coccidiosis, salmonellosis, and Newcastle disease. This study proposes a lightweight machine learning-based approach to detect these diseases by analyzing poultry fecal images. We utilize multi-color space feature extraction (RGB, HSV, LAB) and explore a wide range of color, texture, and shape-based descriptors, including color histograms, local binary patterns (LBP), wavelet transforms, and edge detectors. Through a systematic ablation study and dimensionality reduction using PCA and XGBoost feature selection, we identify a compact global feature set that balances accuracy and computational efficiency. An artificial neural network (ANN) classifier trained on these features achieved 95.85% accuracy while requiring no GPU and only 638 seconds of execution time in Google Colab. Compared to deep learning models such as Xception and MobileNetV3, our proposed model offers comparable accuracy with drastically lower resource usage. This work demonstrates a cost-effective, interpretable, and scalable alternative to deep learning for real-time poultry disease...
Modeling Cholera Dynamics with Vaccination as the Control Strategy and Seasonal-forcing Transmission.
EN: This study presents a seasonally forced cholera model that incorporates imperfect vaccination as a control strategy. The model captures the temporal dynamics of susceptible, vaccinated, infected, and recovered individuals, as well as the environmental pathogen concentration. A key focus is the instantaneous reproduction number, which serves as a threshold indicator for outbreak persistence or elimination. When reproduction number, the disease-free equilibrium is attainable; otherwise, endemic conditions persist. We conduct a sensitivity analysis to evaluate the influence of two critical parameters: the vaccination rate and the waning rate of immunity. Results show that increasing the vaccination rate and reducing the waning rate significantly decrease reproduction number, reinforcing the importance of sustained vaccine efficacy. Seasonal forcing amplifies the complexity of cholera dynamics, revealing the need for timely public health interventions, especially before high-transmission periods. This model demonstrates practical applicability in informing vaccination strategies, especially in resource-limited settings prone to seasonal outbreaks. It offers a flexible framework for pub...
Predictive Representativity: Uncovering Racial Bias in AI-based Skin Cancer Detection.
EN: Artificial intelligence (AI) systems increasingly inform medical decision-making, yet concerns about algorithmic bias and inequitable outcomes persist, particularly for historically marginalized populations. This paper introduces the concept of Predictive Representativity (PR), a framework of fairness auditing that shifts the focus from the composition of the data set to outcomes-level equity. Through a case study in dermatology, we evaluated AI-based skin cancer classifiers trained on the widely used HAM10000 dataset and on an independent clinical dataset (BOSQUE Test set) from Colombia. Our analysis reveals substantial performance disparities by skin phototype, with classifiers consistently underperforming for individuals with darker skin, despite proportional sampling in the source data. We argue that representativity must be understood not as a static feature of datasets but as a dynamic, context-sensitive property of model predictions. PR operationalizes this shift by quantifying how reliably models generalize fairness across subpopulations and deployment contexts. We further propose an External Transportability Criterion that formalizes the thresholds for fairness generalizat...
Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey.
EN: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-de...
Explainable Artificial Intelligence in Biomedical Image Analysis: A Comprehensive Survey.
EN: Explainable artificial intelligence (XAI) has become increasingly important in biomedical image analysis to promote transparency, trust, and clinical adoption of DL models. While several surveys have reviewed XAI techniques, they often lack a modality-aware perspective, overlook recent advances in multimodal and vision-language paradigms, and provide limited practical guidance. This survey addresses this gap through a comprehensive and structured synthesis of XAI methods tailored to biomedical image analysis.We systematically categorize XAI methods, analyzing their underlying principles, strengths, and limitations within biomedical contexts. A modality-centered taxonomy is proposed to align XAI methods with specific imaging types, highlighting the distinct interpretability challenges across modalities. We further examine the emerging role of multimodal learning and vision-language models in explainable biomedical AI, a topic largely underexplored in previous work. Our contributions also include a summary of widely used evaluation metrics and open-source frameworks, along with a critical discussion of persistent challenges and future directions. This survey offers a timely and in-de...
Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM Designs.
EN: To achieve higher system energy efficiency, SRAM in SoCs is often customized. The parasitic effects cause notable discrepancies between pre-layout and post-layout circuit simulations, leading to difficulty in converging design parameters and excessive design iterations. Is it possible to well predict the parasitics based on the pre-layout circuit, so as to perform parasitic-aware pre-layout simulation? In this work, we propose a deep-learning-based 2-stage model to accurately predict these parasitics in pre-layout stages. The model combines a Graph Neural Network (GNN) classifier and Multi-Layer Perceptron (MLP) regressors, effectively managing class imbalance of the net parasitics in SRAM circuits. We also employ Focal Loss to mitigate the impact of abundant internal net samples and integrate subcircuit information into the graph to abstract the hierarchical structure of schematics. Experiments on 4 real SRAM designs show that our approach not only surpasses the state-of-the-art model in parasitic prediction by a maximum of 19X reduction of error but also significantly boosts the simulation process by up to 598X speedup.
Improving AI-Based Canine Heart Disease Diagnosis with Expert-Consensus Auscultation Labeling.
EN: Noisy labels pose significant challenges for AI model training in veterinary medicine. This study examines expert assessment ambiguity in canine auscultation data, highlights the negative impact of label noise on classification performance, and introduces methods for label noise reduction. To evaluate whether label noise can be minimized by incorporating multiple expert opinions, a dataset of 140 heart sound recordings (HSR) was annotated regarding the intensity of holosystolic heart murmurs caused by Myxomatous Mitral Valve Disease (MMVD). The expert opinions facilitated the selection of 70 high-quality HSR, resulting in a noise-reduced dataset. By leveraging individual heart cycles, the training data was expanded and classification robustness was enhanced. The investigation encompassed training and evaluating three classification algorithms: AdaBoost, XGBoost, and Random Forest. While AdaBoost and Random Forest exhibited reasonable performances, XGBoost demonstrated notable improvements in classification accuracy. All algorithms showed significant improvements in classification accuracy due to the applied label noise reduction, most notably XGBoost. Specifically, for the detectio...
Canine Clinical Gait Analysis for Orthopedic and Neurological Disorders: An Inertial Deep-Learning Approach.
EN: Canine gait analysis using wearable inertial sensors is gaining attention in veterinary clinical settings, as it provides valuable insights into a range of mobility impairments. Neurological and orthopedic conditions cannot always be easily distinguished even by experienced clinicians. The current study explored and developed a deep learning approach using inertial sensor readings to assess whether neurological and orthopedic gait could facilitate gait analysis. Our investigation focused on optimizing both performance and generalizability in distinguishing between these gait abnormalities. Variations in sensor configurations, assessment protocols, and enhancements to deep learning model architectures were further suggested. Using a dataset of 29 dogs, our proposed approach achieved 96% accuracy in the multiclass classification task (healthy/orthopedic/neurological) and 82% accuracy in the binary classification task (healthy/non-healthy) when generalizing to unseen dogs. Our results demonstrate the potential of inertial-based deep learning models to serve as a practical and objective diagnostic and clinical aid to differentiate gait assessment in orthopedic and neurological conditio...
FinSurvival: A Suite of Large Scale Survival Modeling Tasks from Finance.
EN: Survival modeling predicts the time until an event occurs and is widely used in risk analysis; for example, it's used in medicine to predict the survival of a patient based on censored data. There is a need for large-scale, realistic, and freely available datasets for benchmarking artificial intelligence (AI) survival models. In this paper, we derive a suite of 16 survival modeling tasks from publicly available transaction data generated by lending of cryptocurrencies in Decentralized Finance (DeFi). Each task was constructed using an automated pipeline based on choices of index and outcome events. For example, the model predicts the time from when a user borrows cryptocurrency coins (index event) until their first repayment (outcome event). We formulate a survival benchmark consisting of a suite of 16 survival-time prediction tasks (FinSurvival). We also automatically create 16 corresponding classification problems for each task by thresholding the survival time using the restricted mean survival time. With over 7.5 million records, FinSurvival provides a suite of realistic financial modeling tasks that will spur future AI survival modeling research. Our evaluation indicated that ...
VaxPulse: Monitoring of Online Public Concerns to Enhance Post-licensure Vaccine Surveillance.
EN: The recent vaccine-related infodemic has amplified public concerns, highlighting the need for proactive misinformation management. We describe how we enhanced the reporting surveillance system of Victoria's vaccine safety service, SAEFVIC, through the incorporation of new information sources for public sentiment analysis, topics of discussion, and hesitancies about vaccinations online. Using VaxPulse, a multi-step framework, we integrate adverse events following immunisation (AEFI) with sentiment analysis, demonstrating the importance of contextualising public concerns. Additionally, we emphasise the need to address non-English languages to stratify concerns across ethno-lingual communities, providing valuable insights for vaccine uptake strategies and combating mis/disinformation. The framework is applied to real-world examples and a case study on women's vaccine hesitancy, showcasing its benefits and adaptability by identifying public opinion from online media.
Integrating Pharmacokinetics and Pharmacodynamics Modeling with Quantum Regression for Predicting Herbal Compound Toxicity.
EN: Herbal compounds present complex toxicity profiles that are often influenced by both intrinsic chemical properties and pharmacokinetics (PK) governing absorption and clearance. In this study, we develop a quantum regression model to predict acute toxicity severity for herbal-derived compounds by integrating toxicity data from NICEATM with pharmacological features from TCMSP.
Integrating Pharmacokinetics and Pharmacodynamics Modeling with Quantum Regression for Predicting Herbal Compound Toxicity.
EN: Herbal compounds present complex toxicity profiles that are often influenced by both intrinsic chemical properties and pharmacokinetics (PK) governing absorption and clearance. In this study, we develop a quantum regression model to predict acute toxicity severity for herbal-derived compounds by integrating toxicity data from NICEATM with pharmacological features from TCMSP.
PocketVina Enables Scalable and Highly Accurate Physically Valid Docking through Multi-Pocket Conditioning.
EN: Sampling physically valid ligand-binding poses remains a major challenge in molecular docking, particularly for unseen or structurally diverse targets. We introduce PocketVina, a fast and memory-efficient, search-based docking framework that combines pocket prediction with systematic multi-pocket exploration. We evaluate PocketVina across four established benchmarks--PDBbind2020 (timesplit and unseen), DockGen, Astex, and PoseBusters--and observe consistently strong performance in sampling physically valid docking poses. PocketVina achieves state-of-the-art performance when jointly considering ligand RMSD and physical validity (PB-valid), while remaining competitive with deep learning-based approaches in terms of RMSD alone, particularly on structurally diverse and previously unseen targets. PocketVina also maintains state-of-the-art physically valid docking accuracy across ligands with varying degrees of flexibility. We further introduce TargetDock-AI, a benchmarking dataset we curated, consisting of over 500000 protein-ligand pairs, and a partition of the dataset labeled with PubChem activity annotations. On this large-scale dataset, PocketVina successfully discriminates active f...
Learn to Vaccinate: Combining Structure Learning and Effective Vaccination for Epidemic and Outbreak Control.
EN: The Susceptible-Infected-Susceptible (SIS) model is a widely used model for the spread of information and infectious diseases, particularly non-immunizing ones, on a graph. Given a highly contagious disease, a natural question is how to best vaccinate individuals to minimize the disease's extinction time. While previous works showed that the problem of optimal vaccination is closely linked to the NP-hard Spectral Radius Minimization (SRM) problem, they assumed that the graph is known, which is often not the case in practice. In this work, we consider the problem of minimizing the extinction time of an outbreak modeled by an SIS model where the graph on which the disease spreads is unknown and only the infection states of the vertices are observed. To this end, we split the problem into two: learning the graph and determining effective vaccination strategies. We propose a novel inclusion-exclusion-based learning algorithm and, unlike previous approaches, establish its sample complexity for graph recovery. We then detail an optimal algorithm for the SRM problem and prove that its running time is polynomial in the number of vertices for graphs with bounded treewidth. This is complemen...
PCS Workflow for Veridical Data Science in the Age of AI.
EN: Data science is a pillar of artificial intelligence (AI), which is transforming nearly every domain of human activity, from the social and physical sciences to engineering and medicine. While data-driven findings in AI offer unprecedented power to extract insights and guide decision-making, many are difficult or impossible to replicate. A key reason for this challenge is the uncertainty introduced by the many choices made throughout the data science life cycle (DSLC). Traditional statistical frameworks often fail to account for this uncertainty. The Predictability-Computability-Stability (PCS) framework for veridical (truthful) data science offers a principled approach to addressing this challenge throughout the DSLC. This paper presents an updated and streamlined PCS workflow, tailored for practitioners and enhanced with guided use of generative AI. We include a running example to display the PCS framework in action, and conduct a related case study which showcases the uncertainty in downstream predictions caused by judgment calls in the data cleaning stage.
Q2SAR: A Quantum Multiple Kernel Learning Approach for Drug Discovery.
EN: Quantitative Structure-Activity Relationship (QSAR) modeling is a cornerstone of computational drug discovery. This research demonstrates the successful application of a Quantum Multiple Kernel Learning (QMKL) framework to enhance QSAR classification, showing a notable performance improvement over classical methods. We apply this methodology to a dataset for identifying DYRK1A kinase inhibitors. The workflow involves converting SMILES representations into numerical molecular descriptors, reducing dimensionality via Principal Component Analysis (PCA), and employing a Support Vector Machine (SVM) trained on an optimized combination of multiple quantum and classical kernels. By benchmarking the QMKL-SVM against a classical Gradient Boosting model, we show that the quantum-enhanced approach achieves a superior AUC score, highlighting its potential to provide a quantum advantage in challenging cheminformatics classification tasks.
In Vitro Antibacterial activity of hexane, Chloroform and methanolic extracts of different parts of Acronychia pedunculata grown in Sri Lanka.
EN: This study accessed the antibacterial potential in vitro of hexane, chloroform and methanol extracts made from leaves, stem bark, flowers, seeds or roots of Sri Lankan grown Acronychia pedunculata plant against two Gram positive bacteria, Staphylococus aureus (ATCC 25923) and Bacilus cereus (ATCC 11778), and two Gram negative bacteria, Pseudomonas aeruginosa (ATCC 9027) and Escherichia coli (ATCC 35218), using agar disc diffusion bioassay technique. The results showed that none the of the extracts provoked an antibacterial action against the two Gram negative bacteria P. aeruginosa and E. coli. Conversely, compared to reference drug, Gentamicin, varying magnitudes of antibacterial activity (concentration: 300 mg/disc) ranging from zero to mild to moderate to strong antibacterial activity was evident with the three solvent systems made from different parts of the plant against the two Gram positive bacteria S. aureus and B. cereus. All the three flower extracts excerted marked antibacterial activity against both S. aureus and B. cereus. The highest antibacterial activity was exhibited by methanol flowers extract (inhibition zone: 13.8-0.32mm), with a Minimum inhibitory value of 32mg...
InstructPro: Natural Language Guided Ligand-Binding Protein Design.
EN: Designing ligand-binding proteins with precise functions is fundamental to advances in biology and chemistry, yet existing AI approaches are limited by scarce protein-ligand complex data. Meanwhile, abundant text descriptions of protein-ligand interactions remain underutilized. We introduce InstructPro, a family of generative models that design proteins from natural language instructions and ligand formulas. InstructPro produces protein sequences consistent with specified functional descriptions and ligand targets. To enable training and evaluation, we develop InstructProBench, a large-scale dataset of 9.6 million (function description, ligand, protein) triples. We train two model variants: InstructPro-1B and InstructPro-3B, which substantially outperform strong baselines. InstructPro-1B achieves design success rates of 2.46% (seen ligands) and 3.14% (zero-shot), while InstructPro-3B reaches 5.06% and 3.93%, respectively. These results demonstrate the potential of natural language-guided generative modeling to expand protein design capabilities beyond traditional data limitations.
Smartphone-integrated RPA-CRISPR-Cas12a Detection System with Microneedle Sampling for Point-of-Care Diagnosis of Potato Late Blight in Early Stage.
EN: Potato late blight, caused by the oomycete pathogen Phytophthora infestans, is one of the most devastating diseases affecting potato crops in the history. Although conventional detection methods of plant diseases such as PCR and LAMP are highly sensitive and specific, they rely on bulky and expensive laboratory equipment and involve complex operations, making them impracticable for point-of care diagnosis in the field. Here in this study, we report a portable RPA-CRISPR based diagnosis system for plant disease, integrating smartphone for acquisition and analysis of fluorescent images. A polyvinyl alcohol (PVA) microneedle patch was employed for sample extraction on the plant leaves within one minute, the DNA extraction efficiency achieved 56 ug/mg, which is approximately 3 times to the traditional CTAB methods (18 ug/mg). The system of RPA-CRISPR-Cas12a isothermal assay was established to specifically target P. infestans with no cross-reactivity observed against closely-related species (P. sojae, P. capsici). The system demonstrated a detection limit of 2 pg/uL for P. infestans genomic DNA, offering sensitivity comparable to that of benchtop laboratory equipment. The system demonst...
Neural networks for the prediction of peel force for skin adhesive interface using FEM simulation.
EN: Studying the peeling behaviour of adhesives on skin is vital for advancing biomedical applications such as medical adhesives and transdermal patches. Traditional methods like experimental testing and finite element method (FEM), though considered gold standards, are resource-intensive, computationally expensive and time-consuming, particularly when analysing a wide material parameter space. In this study, we present a neural network-based approach to predict the minimum peel force (F_min) required for adhesive detachment from skin tissue, limiting the need for repeated FEM simulations and significantly reducing the computational cost. Leveraging a dataset generated from FEM simulations of 90 degree peel test with varying adhesive and fracture mechanics parameters, our neural network model achieved high accuracy, validated through rigorous 5-fold cross-validation. The final architecture was able to predict a wide variety of skin-adhesive peeling behaviour, exhibiting a mean squared error (MSE) of 3.66*10^-7 and a R^2 score of 0.94 on test set, demonstrating robust performance. This work introduces a reliable, computationally efficient method for predicting adhesive behaviour, signif...
Applying XAI based unsupervised knowledge discovering for Operation modes in a WWTP. A real case: AQUAVALL WWTP.
EN: Water reuse is a key point when fresh water is a commodity in ever greater demand, but which is also becoming ever more available. Furthermore, the return of clean water to its natural environment is also mandatory. Therefore, wastewater treatment plants (WWTPs) are essential in any policy focused on these serious challenges. WWTPs are complex facilities which need to operate at their best to achieve their goals. Nowadays, they are largely monitored, generating large databases of historical data concerning their functioning over time. All this implies a large amount of embedded information which is not usually easy for plant managers to assimilate, correlate and understand; in other words, for them to know the global operation of the plant at any given time. At this point, the intelligent and Machine Learning (ML) approaches can give support for that need, managing all the data and translating them into manageable, interpretable and explainable knowledge about how the WWTP plant is operating at a glance. Here, an eXplainable Artificial Intelligence (XAI) based methodology is proposed and tested for a real WWTP, in order to extract explainable service knowledge concerning the op...
A Gaussian process approach for rapid evaluation of skin tension.
EN: Skin tension plays a pivotal role in clinical settings, it affects scarring, wound healing and skin necrosis. Despite its importance, there is no widely accepted method for assessing in vivo skin tension or its natural pre-stretch. This study aims to utilise modern machine learning (ML) methods to develop a model that uses non-invasive measurements of surface wave speed to predict clinically useful skin properties such as stress and natural pre-stretch. A large dataset consisting of simulated wave propagation experiments was created using a simplified two-dimensional finite element (FE) model. Using this dataset, a sensitivity analysis was performed, highlighting the effect of the material parameters and material model on the Rayleigh and supersonic shear wave speeds. Then, a Gaussian process regression model was trained to solve the ill-posed inverse problem of predicting stress and pre-stretch of skin using measurements of surface wave speed. This model had good predictive performance (R2 = 0.9570) and it was possible to interpolate simplified parametric equations to calculate the stress and pre-stretch. To demonstrate that wave speed measurements could be obtained cheaply and ea...
RACE-Align: Retrieval-Augmented and Chain-of-Thought Enhanced Preference Alignment for Large Language Models.
EN: Large Language Models (LLMs) struggle with accuracy, domain-specific reasoning, and interpretability in vertical domains. Traditional preference alignment methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) often overlook the underlying knowledge sources and reasoning logic. This paper introduces RACE-Align (Retrieval-Augmented and Chain-of-Thought Enhanced Alignment), a novel framework designed to address these limitations. RACE-Align systematically constructs a binary preference dataset incorporating external knowledge support and explicit Chain-of-Thought (CoT) reasoning, then aligns LLMs using the DPO algorithm. The core innovation lies in its preference data construction strategy: it integrates AI-driven retrieval for factual grounding, enhancing knowledgeability and accuracy, and emphasizes the optimization of domain-specific CoT, treating the reasoning process itself as a key preference dimension. A multi-stage, AI-driven refinement pipeline cost-effectively generates these preference pairs. Experimental validation in Traditional Chinese Medicine (TCM) using Qwen3-1.7B as the base model demonstrates that RACE-Align signific...
Analysis of in-vivo skin anisotropy using elastic wave measurements and Bayesian modelling.
EN: In vivo skin exhibits viscoelastic, hyper-elastic and non-linear characteristics. It is under a constant non-equibiaxial tension in its natural configuration and is reinforced with oriented collagen fibers, giving rise to anisotropic behaviour. Understanding the complex mechanical behaviour of skin has relevance across many sectors including pharmaceuticals, cosmetics and surgery. However, there is a dearth of quality data characterizing human skin anisotropy in vivo. The available data is usually confined to limited population groups and/or limited angular resolution. Here, we use elastic waves travelling through the skin to obtain measurements from 78 volunteers from 3 to 93 years old. Using a Bayesian framework, we analyse the effect that age, gender and level of skin tension have on the skin anisotropy and stiffness. First, we propose a new measurement of anisotropy based on the eccentricity of angular data and conclude that it is a more robust measurement compared to the classic ``anisotropic ratio". We then find that in vivo skin anisotropy increases logarithmically with age, while the skin stiffness increases linearly along the direction of Langer Lines. We also conclude tha...
Case-Based Reasoning Enhances the Predictive Power of LLMs in Drug-Drug Interaction.
EN: Drug-drug interaction (DDI) prediction is critical for treatment safety. While large language models (LLMs) show promise in pharmaceutical tasks, their effectiveness in DDI prediction remains challenging. Inspired by the well-established clinical practice where physicians routinely reference similar historical cases to guide their decisions through case-based reasoning (CBR), we propose CBR-DDI, a novel framework that distills pharmacological principles from historical cases to improve LLM reasoning for DDI tasks. CBR-DDI constructs a knowledge repository by leveraging LLMs to extract pharmacological insights and graph neural networks (GNNs) to model drug associations. A hybrid retrieval mechanism and dual-layer knowledge-enhanced prompting allow LLMs to effectively retrieve and reuse relevant cases. We further introduce a representative sampling strategy for dynamic case refinement. Extensive experiments demonstrate that CBR-DDI achieves state-of-the-art performance, with a significant 28.7% accuracy improvement over both popular LLMs and CBR baseline, while maintaining high interpretability and flexibility.
Effect of Vaccine Dose Intervals: Considering Immunity Levels, Vaccine Efficacy, and Strain Variants for Disease Control Strategy.
EN: In this study, we present an immuno-epidemic model to understand mitigation options during an epidemic break. The model incorporates comorbidity and multiple-vaccine doses through a system of coupled integro-differential equations to analyze the epidemic rate and intensity from a knowledge of the basic reproduction number and time-distributed rate functions. Our modeling results show that the interval between vaccine doses is a key control parameter that can be tuned to significantly influence disease spread. We show that multiple doses induce a hysteresis effect in immunity levels that offers a better mitigation alternative compared to frequent vaccination which is less cost-effective while being more intrusive. Optimal dosing intervals, emphasizing the cost-effectiveness of each vaccination effort, and determined by various factors such as the level of immunity and efficacy of vaccines against different strains, appear to be crucial in disease management. The model is sufficiently generic that can be extended to accommodate specific disease forms.
A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules.
EN: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules.
EN: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
A collaborative constrained graph diffusion model for the generation of realistic synthetic molecules.
EN: Developing new molecular compounds is crucial to address pressing challenges, from health to environmental sustainability. However, exploring the molecular space to discover new molecules is difficult due to the vastness of the space. Here we introduce CoCoGraph, a collaborative and constrained graph diffusion model capable of generating molecules that are guaranteed to be chemically valid. Thanks to the constraints built into the model and to the collaborative mechanism, CoCoGraph outperforms state-of-the-art approaches on standard benchmarks while requiring up to an order of magnitude fewer parameters. Analysis of 36 chemical properties also demonstrates that CoCoGraph generates molecules with distributions more closely matching real molecules than current models. Leveraging the model's efficiency, we created a database of 8.2M million synthetically generated molecules and conducted a Turing-like test with organic chemistry experts to further assess the plausibility of the generated molecules, and potential biases and limitations of CoCoGraph.
Multi-omic Causal Discovery using Genotypes and Gene Expression.
EN: Causal discovery in multi-omic datasets is crucial for understanding the bigger picture of gene regulatory mechanisms, but remains challenging due to high dimensionality, differentiation of direct from indirect relationships, and hidden confounders. We introduce GENESIS (GEne Network inference from Expression SIgnals and SNPs), a constraint-based algorithm that leverages the natural causal precedence of genotypes to infer ancestral relationships in transcriptomic data. Unlike traditional causal discovery methods that start with a fully connected graph, GENESIS initialises an empty ancestrality matrix and iteratively populates it with direct, indirect or non-causal relationships using a series of provably sound marginal and conditional independence tests. By integrating genotypes as fixed causal anchors, GENESIS provides a principled ``head start'' to classical causal discovery algorithms, restricting the search space to biologically plausible edges. We test GENESIS on synthetic and real-world genomic datasets. This framework offers a powerful avenue for uncovering causal pathways in complex traits, with promising applications to functional genomics, drug discovery, and precision me...
VET-DINO: Learning Anatomical Understanding Through Multi-View Distillation in Veterinary Imaging.
EN: Self-supervised learning has emerged as a powerful paradigm for training deep neural networks, particularly in medical imaging where labeled data is scarce. While current approaches typically rely on synthetic augmentations of single images, we propose VET-DINO, a framework that leverages a unique characteristic of medical imaging: the availability of multiple standardized views from the same study. Using a series of clinical veterinary radiographs from the same patient study, we enable models to learn view-invariant anatomical structures and develop an implied 3D understanding from 2D projections. We demonstrate our approach on a dataset of 5 million veterinary radiographs from 668,000 canine studies. Through extensive experimentation, including view synthesis and downstream task performance, we show that learning from real multi-view pairs leads to superior anatomical understanding compared to purely synthetic augmentations. VET-DINO achieves state-of-the-art performance on various veterinary imaging tasks. Our work establishes a new paradigm for self-supervised learning in medical imaging that leverages domain-specific properties rather than merely adapting natural image techniq...
Modeling the impact of control zone restrictions on pig placement in simulated African swine fever in the United States.
EN: African swine fever (ASF) is a highly contagious viral disease that poses a significant threat to the swine industry, requiring stringent control measures, including movement restrictions that delay pig placements, impacting business continuity. The number and economic impact of unplaced healthy animals due to control zone restrictions remains unmeasured. This study evaluates the economic and epidemiological impacts of control zone placement restrictions during simulated ASF outbreaks in U.S. commercial swine farms. We model the spread of ASF and apply the U.S. National Response Plan (NRP) alongside alternative mitigation strategies, analyzing key metrics such as the number of unplaced pigs, depopulated pigs, infected farms, and total economic losses. Our findings estimate the median number of unplaced pigs in the first year was 153,020 (IQR 27,377 to 1,307,899) under the NRP scenario. Shorter control zone durations (20 to 25 days) effectively reduce the median number of unplaced pigs by 16.7% to 33.5%, whereas longer durations (40 days) increase unplacement numbers by 32%. The median number of depopulated pigs remains broadly consistent across all durations. Expanding the infected...
Multi-Ligand Simultaneous Docking Analysis of Moringa Oleifera Phytochemicals Reveals Enhanced BCL-2 Inhibition via Synergistic Action.
EN: Moringa oleifera, known for its medicinal properties, contains bioactive compounds such as polyphenols and flavonoids with diverse therapeutic potentials, including anti-cancer effects. This study investigates the efficacy of M. oleifera leaf phytochemicals in inhibiting BCL-2, a critical protein involved in cancer cell survival. For the first time, multi-ligand simultaneous docking (MLSD) has been employed to understand the anti-cancer properties of M. oleifera leaf extract. Molecular docking techniques, including single-ligand and MLSD, were used to assess binding interactions with BCL-2. Single-ligand docking revealed strong binding affinities for compounds such as niazinin, alpha carotene, hesperetin, apigenin, niaziminin B, and niazimicin A, with some compounds even surpassing Venetoclax, a commercial BCL-2 inhibitor. MLSD highlighted inter-ligand interactions among apigenin, hesperetin, and niazimicin A, exhibiting a binding affinity of -14.96 kcal/mol, indicating a synergistic effect. These results shed light on the potential synergistic effects of phytochemicals when using multi-ligand simultaneous docking, underscoring the importance of considering compound interactions in...
DeepPlantCRE: A Transformer-CNN Hybrid Framework for Plant Gene Expression Modeling and Cross-Species Generalization.
EN: The investigation of plant transcriptional regulation constitutes a fundamental basis for crop breeding, where cis-regulatory elements (CREs), as the key factor determining gene expression, have become the focus of crop genetic improvement research. Deep learning techniques, leveraging their exceptional capacity for high-dimensional feature extraction and nonlinear regulatory relationship modeling, have been extensively employed in this field. However, current methodologies present notable limitations: single CNN-based architectures struggle to capture long-range regulatory interactions, while existing CNN-Transformer hybrid models demonstrate proneness to overfitting and inadequate generalization in cross-species prediction contexts. To address these challenges, this study proposes DeepPlantCRE, a deep-learning framework for plant gene expression prediction and CRE Extraction. The model employs a Transformer-CNN hybrid architecture that achieves enhanced Accuracy, AUC-ROC, and F1-score metrics over existing baselines (DeepCRE and PhytoExpr), with improved generalization performance and overfitting inhibiting. Cross-species validation experiments conducted on gene expression datase...
In Silico Prediction and Validation of LmGt Inhibitors Using QSAR and Molecular Docking Approaches.
EN: Leishmaniasis caused by Leishmania mexicana relies on Leishmania mexicana gluscose transporter (LmGT) receptors, which play an important role in glucose and ribose uptake at different stages of parasite's life cycle. Previous efforts to identify LmGT inhibitors have been primarily based on in vitro screening. However, this conventional method is limited by inefficiency, high cost, and lack of specificity which leaves a significant gap in the development of targeted therapeutic candidates for LmGT. This study employs computational techniques to address this gap by developing a quantitative structure analysis relationship model, utilizing a support vector machine classifier to identify novel LmGt inhibitor. The QSAR model achieved an accuracy of 0.81 in differentiating active compounds. Molecular docking further validated the identified inhibitors, revealing strong binding affinities with a top score of -9.46. The docking analysis showed that the inhibitors formed multiple hydrogen bonds and occupied the same binding pockets as Phase 3 drug candidate. The tested inhibitors were derived from natural sources, which suggest reduced side effects and improved biocompability. This combined...
Piloting Structure-Based Drug Design via Modality-Specific Optimal Schedule.
EN: Structure-Based Drug Design (SBDD) is crucial for identifying bioactive molecules. Recent deep generative models are faced with challenges in geometric structure modeling. A major bottleneck lies in the twisted probability path of multi-modalities -- continuous 3D positions and discrete 2D topologies -- which jointly determine molecular geometries. By establishing the fact that noise schedules decide the Variational Lower Bound (VLB) for the twisted probability path, we propose VLB-Optimal Scheduling (VOS) strategy in this under-explored area, which optimizes VLB as a path integral for SBDD. Our model effectively enhances molecular geometries and interaction modeling, achieving state-of-the-art PoseBusters passing rate of 95.9% on CrossDock, more than 10% improvement upon strong baselines, while maintaining high affinities and robust intramolecular validity evaluated on held-out test set. Code is available at https://github.com/AlgoMole/MolCRAFT.
A Computational Approach to Epilepsy Treatment: An AI-optimized Global Natural Product Prescription System.
EN: Epilepsy is a prevalent neurological disease with millions of patients worldwide. Many patients have turned to alternative medicine due to the limited efficacy and side effects of conventional antiepileptic drugs. In this study, we developed a computational approach to optimize herbal epilepsy treatment through AI-driven analysis of global natural products and statistically validated randomized controlled trials (RCTs). Our intelligent prescription system combines machine learning (ML) algorithms for herb-efficacy characterization, Bayesian optimization for personalized dosing, and meta-analysis of RCTs for evidence-based recommendations. The system analyzed 1,872 natural compounds from traditional Chinese medicine (TCM), Ayurveda, and ethnopharmacological databases, integrating their bioactive properties with clinical outcomes from 48 RCTs covering 48 epilepsy conditions (n=5,216). Using LASSO regression and SHAP value analysis, we identified 17 high-efficacy herbs (e.g., Gastrodia elata [using é for accented characters], Withania somnifera), showing significant seizure reduction (p$<$0.01, Cohen's d=0.89) with statistical significance confirmed by multiple testing (p$<$0.001). A ...
Sucrose ester surfactants: current understanding and emerging perspectives.
EN: Sucrose esters (SEs), derived from sucrose and fatty acids, are biodegradable and non-toxic surfactants increasingly favored as substitutes for petrochemically-synthesized ones in food, cosmetics, and pharmaceuticals. SEs provide versatile hydrophilic-lipophilic properties, determined by the degree of sucrose esterification ranging from one to eight. The length of the fatty acid residues further influences the phase behavior of SEs, allowing creation of tailored formulations for specific applications. This review provides insights about our current understanding of the SEs phase behavior, their aggregation in aqueous and oily solutions, and its correlation with formulation outcomes. Furthermore, an overview of recent studies investigating SEs in various colloidal systems, incl. emulsions, foams, oleogels, and others, is provided. Novel concepts are discussed alongside future research directions, emphasizing the SEs potential as sustainable, functional ingredients.
Quantum QSAR for drug discovery.
EN: Quantitative Structure-Activity Relationship (QSAR) modeling is key in drug discovery, but classical methods face limitations when handling high-dimensional data and capturing complex molecular interactions. This research proposes enhancing QSAR techniques through Quantum Support Vector Machines (QSVMs), which leverage quantum computing principles to process information Hilbert spaces. By using quantum data encoding and quantum kernel functions, we aim to develop more accurate and efficient predictive models.
Local Herb Identification Using Transfer Learning: A CNN-Powered Mobile Application for Nepalese Flora.
EN: Herb classification presents a critical challenge in botanical research, particularly in regions with rich biodiversity such as Nepal. This study introduces a novel deep learning approach for classifying 60 different herb species using Convolutional Neural Networks (CNNs) and transfer learning techniques. Using a manually curated dataset of 12,000 herb images, we developed a robust machine learning model that addresses existing limitations in herb recognition methodologies. Our research employed multiple model architectures, including DenseNet121, 50-layer Residual Network (ResNet50), 16-layer Visual Geometry Group Network (VGG16), InceptionV3, EfficientNetV2, and Vision Transformer (VIT), with DenseNet121 ultimately demonstrating superior performance. Data augmentation and regularization techniques were applied to mitigate overfitting and enhance the generalizability of the model. This work advances herb classification techniques, preserving traditional botanical knowledge and promoting sustainable herb utilization.
PoseX: AI Defeats Physics Approaches on Protein-Ligand Cross Docking.
EN: Existing protein-ligand docking studies typically focus on the self-docking scenario, which is less practical in real applications. Moreover, some studies involve heavy frameworks requiring extensive training, posing challenges for convenient and efficient assessment of docking methods. To fill these gaps, we design PoseX, an open-source benchmark to evaluate both self-docking and cross-docking, enabling a practical and comprehensive assessment of algorithmic advances. Specifically, we curated a novel dataset comprising 718 entries for self-docking and 1,312 entries for cross-docking; second, we incorporated 23 docking methods in three methodological categories, including physics-based methods (e.g., Schrödinger Glide), AI docking methods (e.g., DiffDock) and AI co-folding methods (e.g., AlphaFold3); third, we developed a relaxation method for post-processing to minimize conformational energy and refine binding poses; fourth, we built a leaderboard to rank submitted models in real-time. We derived some key insights and conclusions from extensive experiments: (1) AI approaches have consistently outperformed physics-based methods in overall docking success rate. (2) Most intra- and i...
T-REX: Vision-Based System for Autonomous Leaf Detection and Grasp Estimation.
EN: T-Rex (The Robot for Extracting Leaf Samples) is a gantry-based robotic system developed for autonomous leaf localization, selection, and grasping in greenhouse environments. The system integrates a 6-degree-of-freedom manipulator with a stereo vision pipeline to identify and interact with target leaves. YOLOv8 is used for real-time leaf segmentation, and RAFT-Stereo provides dense depth maps, allowing the reconstruction of 3D leaf masks. These observations are processed through a leaf grasping algorithm that selects the optimal leaf based on clutter, visibility, and distance, and determines a grasp point by analyzing local surface flatness, top-down approachability, and margin from edges. The selected grasp point guides a trajectory executed by ROS-based motion controllers, driving a custom microneedle-equipped end-effector to clamp the leaf and simulate tissue sampling. Experiments conducted with artificial plants under varied poses demonstrate that the T-Rex system can consistently detect, plan, and perform physical interactions with plant-like targets, achieving a grasp success rate of 66.6\%. This paper presents the system architecture, implementation, and testing of T-Rex as ...
Leveraging Partial SMILES Validation Scheme for Enhanced Drug Design in Reinforcement Learning Frameworks.
EN: SMILES-based molecule generation has emerged as a powerful approach in drug discovery. Deep reinforcement learning (RL) using large language model (LLM) has been incorporated into the molecule generation process to achieve high matching score in term of likelihood of desired molecule candidates. However, a critical challenge in this approach is catastrophic forgetting during the RL phase, where knowledge such as molecule validity, which often exceeds 99\% during pretraining, significantly deteriorates. Current RL algorithms applied in drug discovery, such as REINVENT, use prior models as anchors to retian pretraining knowledge, but these methods lack robust exploration mechanisms. To address these issues, we propose Partial SMILES Validation-PPO (PSV-PPO), a novel RL algorithm that incorporates real-time partial SMILES validation to prevent catastrophic forgetting while encouraging exploration. Unlike traditional RL approaches that validate molecule structures only after generating entire sequences, PSV-PPO performs stepwise validation at each auto-regressive step, evaluating not only the selected token candidate but also all potential branches stemming from the prior partial seque...
Node2Vec-DGI-EL: A Hierarchical Graph Representation Learning Model for Ingredient-Disease Association Prediction.
EN: Traditional Chinese medicine, as an essential component of traditional medicine, contains active ingredients that serve as a crucial source for modern drug development, holding immense therapeutic potential and development value. A multi-layered and complex network is formed from Chinese medicine to diseases and used to predict the potential associations between Chinese medicine ingredients and diseases. This study proposes an ingredient-disease association prediction model (Node2Vec-DGI-EL) based on hierarchical graph representation learning. First, the model uses the Node2Vec algorithm to extract node embedding vectors from the network as the initial features of the nodes. Next, the network nodes are deeply represented and learned using the DGI algorithm to enhance the model's expressive power. To improve prediction accuracy and robustness, an ensemble learning method is incorporated to achieve more accurate ingredient-disease association predictions. The effectiveness of the model is then evaluated through a series of theoretical verifications. The results demonstrated that the proposed model significantly outperformed existing methods, achieving an AUC of 0.9987 and an AUPR of ...
Pan-genome Analysis of Angiosperm Plastomes using PGR-TK.
EN: We present a novel approach for taxonomic analysis of chloroplast genomes in angiosperms using the Pan-genome Research Toolkit (PGR-TK). Comparative plots generated by PGR-TK across diverse angiosperm genera reveal a wide range of structural complexity, from straightforward to highly intricate patterns. Notably, the characteristic quadripartite plastome structure, comprising the large single copy (LSC), small single copy (SSC), and inverted repeat (IR) regions, is clearly identifiable in over 75% of the genera analyzed. Our findings also underscore several occurrences of species mis-annotations in public genomic databases, which are readily detected through visual anomalies in the PGR-TK plots. While more complex plot patterns remain difficult to interpret, they likely reflect underlying biological variation or technical inconsistencies in genome assembly. Overall, this approach effectively integrates classical botanical visualization with modern molecular taxonomy, providing a powerful tool for genome-based classification in plant systematics.
Model uncertainty quantification using feature confidence sets for outcome excursions.
EN: When implementing prediction models for high-stakes real-world applications such as medicine, finance, and autonomous systems, quantifying prediction uncertainty is critical for effective risk management. Traditional approaches to uncertainty quantification, such as confidence and prediction intervals, provide probability coverage guarantees for the expected outcomes $f(\boldsymbol{x})$ or the realized outcomes $f(\boldsymbol{x})+ε$. Instead, this paper introduces a novel, model-agnostic framework for quantifying uncertainty in continuous and binary outcomes using confidence sets for outcome excursions, where the goal is to identify a subset of the feature space where the expected or realized outcome exceeds a specific value. The proposed method constructs data-dependent inner and outer confidence sets that aim to contain the true feature subset for which the expected or realized outcomes of these features exceed a specified threshold. We establish theoretical guarantees for the probability that these confidence sets contain the true feature subset, both asymptotically and for finite sample sizes. The framework is validated through simulations and applied to real-world datasets, de...
Direct Video-Based Spatiotemporal Deep Learning for Cattle Lameness Detection.
EN: Cattle lameness is a prevalent health problem in livestock farming, often resulting from hoof injuries or infections, and severely impacts animal welfare and productivity. Early and accurate detection is critical for minimizing economic losses and ensuring proper treatment. This study proposes a spatiotemporal deep learning framework for automated cattle lameness detection using publicly available video data. We curate and publicly release a balanced set of 50 online video clips featuring 42 individual cattle, recorded from multiple viewpoints in both indoor and outdoor environments. The videos were categorized into lame and non-lame classes based on visual gait characteristics and metadata descriptions. After applying data augmentation techniques to enhance generalization, two deep learning architectures were trained and evaluated: 3D Convolutional Neural Networks (3D CNN) and Convolutional Long-Short-Term Memory (ConvLSTM2D). The 3D CNN achieved a video-level classification accuracy of 90%, with a precision, recall, and F1 score of 90.9% each, outperforming the ConvLSTM2D model, which achieved 85% accuracy. Unlike conventional approaches that rely on multistage pipelines involvin...
Audio-Visual Class-Incremental Learning for Fish Feeding intensity Assessment in Aquaculture.
EN: Fish Feeding Intensity Assessment (FFIA) is crucial in industrial aquaculture management. Recent multi-modal approaches have shown promise in improving FFIA robustness and efficiency. However, these methods face significant challenges when adapting to new fish species or environments due to catastrophic forgetting and the lack of suitable datasets. To address these limitations, we first introduce AV-CIL-FFIA, a new dataset comprising 81,932 labelled audio-visual clips capturing feeding intensities across six different fish species in real aquaculture environments. Then, we pioneer audio-visual class incremental learning (CIL) for FFIA and demonstrate through benchmarking on AV-CIL-FFIA that it significantly outperforms single-modality methods. Existing CIL methods rely heavily on historical data. Exemplar-based approaches store raw samples, creating storage challenges, while exemplar-free methods avoid data storage but struggle to distinguish subtle feeding intensity variations across different fish species. To overcome these limitations, we introduce HAIL-FFIA, a novel audio-visual class-incremental learning framework that bridges this gap with a prototype-based approach that achi...
p2smi: A Python Toolkit for Peptide FASTA-to-SMILES Conversion and Molecular Property Analysis.
EN: Converting peptide sequences into useful representations for downstream analysis is a common step in computational modeling and cheminformatics. Furthermore, peptide drugs (e.g., Semaglutide, Degarelix) often take advantage of the diverse chemistries found in noncanonical amino acids (NCAAs), altered stereochemistry, and backbone modifications. Despite there being several chemoinformatics toolkits, none are tailored to the task of converting a modified peptide from an amino acid representation to the chemical string nomenclature Simplified Molecular-Input Line-Entry System (SMILES), often used in chemical modeling. Here we present p2smi, a Python toolkit with CLI, designed to facilitate the conversion of peptide sequences into chemical SMILES strings. By supporting both cyclic and linear peptides, including those with NCAAs, p2smi enables researchers to generate accurate SMILES strings for drug-like peptides, reducing the overhead for computational modeling and cheminformatics analyses. The toolkit also offers functionalities for chemical modification, synthesis feasibility evaluation, and calculation of molecular properties such as hydrophobicity, topological polar surface area, m...
Robustness and sex differences in skin cancer detection: logistic regression vs CNNs.
EN: Deep learning has been reported to achieve high performances in the detection of skin cancer, yet many challenges regarding the reproducibility of results and biases remain. This study is a replication (different data, same analysis) of a previous study on Alzheimer's disease detection, which studied the robustness of logistic regression (LR) and convolutional neural networks (CNN) across patient sexes. We explore sex bias in skin cancer detection, using the PAD-UFES-20 dataset with LR trained on handcrafted features reflecting dermatological guidelines (ABCDE and the 7-point checklist), and a pre-trained ResNet-50 model. We evaluate these models in alignment with the replicated study: across multiple training datasets with varied sex composition to determine their robustness. Our results show that both the LR and the CNN were robust to the sex distribution, but the results also revealed that the CNN had a significantly higher accuracy (ACC) and area under the receiver operating characteristics (AUROC) for male patients compared to female patients. The data and relevant scripts to reproduce our results are publicly available (https://github.com/ nikodice4/Skin-cancer-detection-sex-...
Worm-like emulsion droplets.
EN: Forming an interface between immiscible fluids incurs a free-energy cost that usually favors minimizing the interfacial area. An emulsion droplet of fixed volume therefore tends to form a sphere, and pairs of droplets tend to coalesce. Surfactant molecules adsorbed to the droplets' surfaces stabilize emulsions by providing a kinetic barrier to coalescence. Here, we show that the bound surfactants' osmotic pressure also competes with the droplet's intrinsic surface tension and can reverse the sign of the overall surface free energy. The onset of negative surface tension favors maximizing surface area and therefore favors elongation into a worm-like morphology. Analyzing this system in the Gibbs grand canonical ensemble reveals a phase transition between spherical and worm-like emulsions that is governed by the chemical potential of surfactant molecules in solution. Predictions based on this model agree with the observed behavior of an experimental model system composed of lipid-stabilized silicone oil droplets in an aqueous surfactant solution.
Opening the Black-Box: Symbolic Regression with Kolmogorov-Arnold Networks for Energy Applications.
EN: While most modern machine learning methods offer speed and accuracy, few promise interpretability or explainability -- two key features necessary for highly sensitive industries, like medicine, finance, and engineering. Using eight datasets representative of one especially sensitive industry, nuclear power, this work compares a traditional feedforward neural network (FNN) to a Kolmogorov-Arnold Network (KAN). We consider not only model performance and accuracy, but also interpretability through model architecture and explainability through a post-hoc SHAP analysis. In terms of accuracy, we find KANs and FNNs comparable across all datasets, when output dimensionality is limited. KANs, which transform into symbolic equations after training, yield perfectly interpretable models while FNNs remain black-boxes. Finally, using the post-hoc explainability results from Kernel SHAP, we find that KANs learn real, physical relations from experimental data, while FNNs simply produce statistically accurate results. Overall, this analysis finds KANs a promising alternative to traditional machine learning methods, particularly in applications requiring both accuracy and comprehensibility.
Surface forces and frictional properties of adsorbed bio-based cationic polysaccharide thin films in salted aqueous medium.
EN: Inter-surface forces mediated by polymer films are important in a range of technological and industrial situations. In cosmetics, applications such as hair conditioning typically rely on the adsorption of polyelectrolyte films onto the charged surface of hair fibers, whose contact mechanics and tribological properties are central in determining the final sensorial perceptions associated with the cosmetic treatment. A major current challenge to be tackled by the cosmetic industry is to design high-performance products employing bio-sourced polyelectrolytes, with the aim of achieving eco-sustainable processes and products. In this context, the present study focuses on the mechanical properties of thin films obtained by adsorption from solution of fungal chitosan onto negatively charged mica surfaces. We use a Surface Forces Apparatus allowing for the simultaneous measurement of film thickness and friction force as a function of the applied normal load and shear velocity. We show that, in aqueous medium at an ionic strength of 40 mM, adsorbed films of chitosan give rise to repulsive inter-surface forces whose range, comparable to the Flory radius of the macromolecules, increases with ...
Addressing Model Overcomplexity in Drug-Drug Interaction Prediction With Molecular Fingerprints.
EN: Accurately predicting drug-drug interactions (DDIs) is crucial for pharmaceutical research and clinical safety. Recent deep learning models often suffer from high computational costs and limited generalization across datasets. In this study, we investigate a simpler yet effective approach using molecular representations such as Morgan fingerprints (MFPS), graph-based embeddings from graph convolutional networks (GCNs), and transformer-derived embeddings from MoLFormer integrated into a straightforward neural network. We benchmark our implementation on DrugBank DDI splits and a drug-drug affinity (DDA) dataset from the Food and Drug Administration. MFPS along with MoLFormer and GCN representations achieve competitive performance across tasks, even in the more challenging leak-proof split, highlighting the sufficiency of simple molecular representations. Moreover, we are able to identify key molecular motifs and structural patterns relevant to drug interactions via gradient-based analyses using the representations under study. Despite these results, dataset limitations such as insufficient chemical diversity, limited dataset size, and inconsistent labeling impact robust evaluation an...
Multimodal Data Integration for Sustainable Indoor Gardening: Tracking Anyplant with Time Series Foundation Model.
EN: Indoor gardening within sustainable buildings offers a transformative solution to urban food security and environmental sustainability. By 2030, urban farming, including Controlled Environment Agriculture (CEA) and vertical farming, is expected to grow at a compound annual growth rate (CAGR) of 13.2% from 2024 to 2030, according to market reports. This growth is fueled by advancements in Internet of Things (IoT) technologies, sustainable innovations such as smart growing systems, and the rising interest in green interior design. This paper presents a novel framework that integrates computer vision, machine learning (ML), and environmental sensing for the automated monitoring of plant health and growth. Unlike previous approaches, this framework combines RGB imagery, plant phenotyping data, and environmental factors such as temperature and humidity, to predict plant water stress in a controlled growth environment. The system utilizes high-resolution cameras to extract phenotypic features, such as RGB, plant area, height, and width while employing the Lag-Llama time series model to analyze and predict water stress. Experimental results demonstrate that integrating RGB, size ratios, a...
TransDiffSBDD: Causality-Aware Multi-Modal Structure-Based Drug Design.
EN: Structure-based drug design (SBDD) is a critical task in drug discovery, requiring the generation of molecular information across two distinct modalities: discrete molecular graphs and continuous 3D coordinates. However, existing SBDD methods often overlook two key challenges: (1) the multi-modal nature of this task and (2) the causal relationship between these modalities, limiting their plausibility and performance. To address both challenges, we propose TransDiffSBDD, an integrated framework combining autoregressive transformers and diffusion models for SBDD. Specifically, the autoregressive transformer models discrete molecular information, while the diffusion model samples continuous distributions, effectively resolving the first challenge. To address the second challenge, we design a hybrid-modal sequence for protein-ligand complexes that explicitly respects the causality between modalities. Experiments on the CrossDocked2020 benchmark demonstrate that TransDiffSBDD outperforms existing baselines.
Compositional Analysis of Fragrance Accords Using Femtosecond Thermal Lens Spectroscopy.
EN: Femtosecond thermal lens spectroscopy (FTLS) is a powerful analytical tool, yet its application to complex, multi-component mixtures like fragrance accords remains limited. Here, we introduce and validate a unified metric, the Femtosecond Thermal Lens Integrated Magnitude (FTL-IM), to characterize such mixtures. The FTL-IM, derived from the integrated signal area, provides a direct, model-free measure of the total thermo-optical response, including critical convective effects. Applying the FTL-IM to complex six-component accords, we demonstrate its utility in predicting a mixture's thermal response from its composition through linear additivity with respect to component mole fractions. Our method quantifies the accords' behavior, revealing both the baseline contributions of components and the dominant, non-linear effects of highly-active species like Methyl Anthranilate. This consistency is validated across single-beam Z-scan, dual-beam Z-scan, and time-resolved FTLS measurements. The metric also demonstrates the necessity of single-beam measurements for interpreting dual-beam data. This work establishes a rapid, quantitative method for fragrance analysis, offering advantages for q...
Adhesion differentials control the rheology of biomimetic emulsions.
EN: Animal morphogenesis involves complex tissue deformation processes, which require tight control over tissue rheology. Yet, it remains insufficiently understood how tissue rheology results from the interplay between cellular packing and cellular forces, such as cortical tension, cell pressure, and cell-cell adhesion. Here, we follow a biomimetic approach to study this interplay. We mimic adhesive cells with oil droplets whose adhesion strength and specificity can be flexibly tuned. Using microfluidics, we expose 2D emulsions to an oscillatory geometry imposing cyclic pure shear, and we develop a geometric method to quantify their rheology using only imaging data. We find that some of the emulsions made of two droplet types progressively change their yielding behavior across subsequent shear cycles. Combining this with vertex model simulations, we show that the observed shift in yielding behavior is due to a progressive compaction, which only occurs in emulsions with a high adhesion differential and only when exposed to oscillatory shear. Gradients of cell compaction have been observed during animal development. Our work demonstrates how such gradients can be used to control gradient...
Clarifying Misconceptions in COVID-19 Vaccine Sentiment and Stance Analysis and Their Implications for Vaccine Hesitancy Mitigation: A Systematic Review.
EN: Background Advances in machine learning (ML) models have increased the capability of researchers to detect vaccine hesitancy in social media using Natural Language Processing (NLP). A considerable volume of research has identified the persistence of COVID-19 vaccine hesitancy in discourse shared on various social media platforms. Methods Our objective in this study was to conduct a systematic review of research employing sentiment analysis or stance detection to study discourse towards COVID-19 vaccines and vaccination spread on Twitter (officially known as X since 2023). Following registration in the PROSPERO international registry of systematic reviews, we searched papers published from 1 January 2020 to 31 December 2023 that used supervised machine learning to assess COVID-19 vaccine hesitancy through stance detection or sentiment analysis on Twitter. We categorized the studies according to a taxonomy of five dimensions: tweet sample selection approach, self-reported study type, classification typology, annotation codebook definitions, and interpretation of results. We analyzed if studies using stance detection report different hesitancy trends than those using sentiment analysi...
Linear to Neural Networks Regression: QSPR of Drugs via Degree-Distance Indices.
EN: This study conducts a Quantitative Structure Property Relationship (QSPR) analysis to explore the correlation between the physical properties of drug molecules and their topological indices using machine learning techniques. While prior studies in drug design have focused on degree-based topological indices, this work analyzes a dataset of 166 drug molecules by computing degree-distance-based topological indices, incorporating vertex-edge weightings with respect to different six atomic properties (atomic number, atomic radius, atomic mass, density, electronegativity, ionization). Both linear models (Linear Regression, Lasso, and Ridge Regression) and nonlinear approaches (Random Forest, XGBoost, and Neural Networks) were employed to predict molecular properties. The results demonstrate the effectiveness of these indices in predicting specific physicochemical properties and underscore the practical relevance of computational methods in molecular property estimation. The study provides an innovative perspective on integrating topological indices with machine learning to enhance predictive accuracy, highlighting their potential application in drug discovery and development processes. ...
SMPR: A structure-enhanced multimodal drug-disease prediction model for drug repositioning and cold start.
EN: Repositioning drug-disease relationships has always been a hot field of research. However, actual cases of biologically validated drug relocation remain very limited, and existing models have not yet fully utilized the structural information of the drug. Furthermore, most repositioning models are only used to complete the relationship matrix, and their practicality is poor when dealing with drug cold start problems. This paper proposes a structure-enhanced multimodal relationship prediction model (SMRP). SMPR is based on the SMILE structure of the drug, using the Mol2VEC method to generate drug embedded representations, and learn disease embedded representations through heterogeneous network graph neural networks. Ultimately, a drug-disease relationship matrix is constructed. In addition, to reduce the difficulty of users' use, SMPR also provides a cold start interface based on structural similarity based on reposition results to simply and quickly predict drug-related diseases. The repositioning ability and cold start capability of the model are verified from multiple perspectives. While the AUC and ACUPR scores of repositioning reach 99% and 61% respectively, the AUC of cold star...
Subgroup Performance Analysis in Hidden Stratifications.
EN: Machine learning (ML) models may suffer from significant performance disparities between patient groups. Identifying such disparities by monitoring performance at a granular level is crucial for safely deploying ML to each patient. Traditional subgroup analysis based on metadata can expose performance disparities only if the available metadata (e.g., patient sex) sufficiently reflects the main reasons for performance variability, which is not common. Subgroup discovery techniques that identify cohesive subgroups based on learned feature representations appear as a potential solution: They could expose hidden stratifications and provide more granular subgroup performance reports. However, subgroup discovery is challenging to evaluate even as a standalone task, as ground truth stratification labels do not exist in real data. Subgroup discovery has thus neither been applied nor evaluated for the application of subgroup performance monitoring. Here, we apply subgroup discovery for performance monitoring in chest x-ray and skin lesion classification. We propose novel evaluation strategies and show that a simplified subgroup discovery method without access to classification labels or met...
FMCHS: Advancing Traditional Chinese Medicine Herb Recommendation with Fusion of Multiscale Correlations of Herbs and Symptoms.
EN: Traditional Chinese medicine (TCM) exhibits remarkable therapeutic efficacy in disease treatment and healthcare through personalized herb prescriptions. However, current herb recommendation models inadequately capture the multiscale relations between herbs and clinical symptoms, particularly neglecting latent correlations at the chemical-molecular scale. To address these limitations, we propose the Fusion of Multiscale Correlations of Herbs and Symptoms (FMCHS), an innovative framework that synergistically integrates molecular-scale chemical characteristics of herbs with clinical symptoms. The framework employs multi-relational graph transformer layers to generate enriched embeddings that preserve both structural and semantic features within herbs and symptoms. Through systematic incorporation of herb chemical profiles into node embeddings and implementation of attention-based feature fusion, FMCHS effectively utilizes multiscale correlations. Comprehensive evaluations demonstrate FMCHS's superior performance over the state-of-the-art (SOTA) baseline, achieving relative improvements of 8.85% in Precision@5, 12.30% in Recall@5, and 10.86% in F1@5 compared to the SOTA model on benchm...
A Generalist Cross-Domain Molecular Learning Framework for Structure-Based Drug Discovery.
EN: Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross-domain relationships pivotal to SBDD. To fill this gap, we propose a general-purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture-of-Domain-Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture-of-Structure-Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture-of-experts approach enab...
Integrating Protein Dynamics into Structure-Based Drug Design via Full-Atom Stochastic Flows.
EN: The dynamic nature of proteins, influenced by ligand interactions, is essential for comprehending protein function and progressing drug discovery. Traditional structure-based drug design (SBDD) approaches typically target binding sites with rigid structures, limiting their practical application in drug development. While molecular dynamics simulation can theoretically capture all the biologically relevant conformations, the transition rate is dictated by the intrinsic energy barrier between them, making the sampling process computationally expensive. To overcome the aforementioned challenges, we propose to use generative modeling for SBDD considering conformational changes of protein pockets. We curate a dataset of apo and multiple holo states of protein-ligand complexes, simulated by molecular dynamics, and propose a full-atom flow model (and a stochastic version), named DynamicFlow, that learns to transform apo pockets and noisy ligands into holo pockets and corresponding 3D ligand molecules. Our method uncovers promising ligand molecules and corresponding holo conformations of pockets. Additionally, the resultant holo-like states provide superior inputs for traditional SBDD appr...
To Vaccinate or not to Vaccinate? Analyzing $\mathbb{X}$ Power over the Pandemic.
EN: The COVID-19 pandemic has profoundly affected the normal course of life -- from lock-downs and virtual meetings to the unprecedentedly swift creation of vaccines. To halt the COVID-19 pandemic, the world has started preparing for the global vaccine roll-out. In an effort to navigate the immense volume of information about COVID-19, the public has turned to social networks. Among them, $\mathbb{X}$ (formerly Twitter) has played a key role in distributing related information. Most people are not trained to interpret medical research and remain skeptical about the efficacy of new vaccines. Measuring their reactions and perceptions is gaining significance in the fight against COVID-19. To assess the public perception regarding the COVID-19 vaccine, our work applies a sentiment analysis approach, using natural language processing of $\mathbb{X}$ data. We show how to use textual analytics and textual data visualization to discover early insights (for example, by analyzing the most frequently used keywords and hashtags). Furthermore, we look at how people's sentiments vary across the countries. Our results indicate that although the overall reaction to the vaccine is positive, there are a...
Biomedical Foundation Model: A Survey.
EN: Foundation models, first introduced in 2021, are large-scale pre-trained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in leveraging artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models across diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.
Biomedical Foundation Model: A Survey.
EN: Foundation models, first introduced in 2021, are large-scale pre-trained models (e.g., large language models (LLMs) and vision-language models (VLMs)) that learn from extensive unlabeled datasets through unsupervised methods, enabling them to excel in diverse downstream tasks. These models, like GPT, can be adapted to various applications such as question answering and visual understanding, outperforming task-specific AI models and earning their name due to broad applicability across fields. The development of biomedical foundation models marks a significant milestone in leveraging artificial intelligence (AI) to understand complex biological phenomena and advance medical research and practice. This survey explores the potential of foundation models across diverse domains within biomedical fields, including computational biology, drug discovery and development, clinical informatics, medical imaging, and public health. The purpose of this survey is to inspire ongoing research in the application of foundation models to health science.
Pushing the boundaries of Structure-Based Drug Design through Collaboration with Large Language Models.
EN: Structure-Based Drug Design (SBDD) has revolutionized drug discovery by enabling the rational design of molecules for specific protein targets. Despite significant advancements in improving docking scores, advanced 3D-SBDD generative models still face challenges in producing drug-like candidates that meet medicinal chemistry standards and pharmacokinetic requirements. These limitations arise from their inherent focus on molecular interactions, often neglecting critical aspects of drug-likeness. To address these shortcomings, we introduce the Collaborative Intelligence Drug Design (CIDD) framework, which combines the structural precision of 3D-SBDD models with the chemical reasoning capabilities of large language models (LLMs). CIDD begins by generating supporting molecules with 3D-SBDD models and then refines these molecules through LLM-supported modules to enhance drug-likeness and structural reasonability. When evaluated on the CrossDocked2020 dataset, CIDD achieved a remarkable success ratio of 37.94%, significantly outperforming the previous state-of-the-art benchmark of 15.72%. Although improving molecular interactions and drug-likeness is often seen as a trade-off, CIDD uniqu...
Molecule Generation for Target Protein Binding with Hierarchical Consistency Diffusion Model.
EN: Effective generation of molecular structures, or new chemical entities, that bind to target proteins is crucial for lead identification and optimization in drug discovery. Despite advancements in atom- and motif-wise deep learning models for 3D molecular generation, current methods often struggle with validity and reliability. To address these issues, we develop the Atom-Motif Consistency Diffusion Model (AMDiff), utilizing a joint-training paradigm for multi-view learning. This model features a hierarchical diffusion architecture that integrates both atom- and motif-level views of molecules, allowing for comprehensive exploration of complementary information. By leveraging classifier-free guidance and incorporating binding site features as conditional inputs, AMDiff ensures robust molecule generation across diverse targets. Compared to existing approaches, AMDiff exhibits superior validity and novelty in generating molecules tailored to fit various protein pockets. Case studies targeting protein kinases, including Anaplastic Lymphoma Kinase (ALK) and Cyclin-dependent kinase 4 (CDK4), demonstrate the model's capability in structure-based de novo drug design. Overall, AMDiff bridges...
Diagnostic Method for Hydropower Plant Condition-based Maintenance combining Autoencoder with Clustering Algorithms.
EN: The French company EDF uses supervisory control and data acquisition systems in conjunction with a data management platform to monitor hydropower plant, allowing engineers and technicians to analyse the time-series collected. Depending on the strategic importance of the monitored hydropower plant, the number of time-series collected can vary greatly making it difficult to generate valuable information from the extracted data. In an attempt to provide an answer to this particular problem, a condition detection and diagnosis method combining clustering algorithms and autoencoder neural networks for pattern recognition has been developed and is presented in this paper. First, a dimension reduction algorithm is used to create a 2-or 3-dimensional projection that allows the users to identify unsuspected relationships between datapoints. Then, a collection of clustering algorithms regroups the datapoints into clusters. For each identified cluster, an autoencoder neural network is trained on the corresponding dataset. The aim is to measure the reconstruction error between each autoencoder model and the measured values, thus creating a proximity index for each state discovered during the c...
Auto-ADMET: An Effective and Interpretable AutoML Method for Chemical ADMET Property Prediction.
EN: Machine learning (ML) has been playing important roles in drug discovery in the past years by providing (pre-)screening tools for prioritising chemical compounds to pass through wet lab experiments. One of the main ML tasks in drug discovery is to build quantitative structure-activity relationship (QSAR) models, associating the molecular structure of chemical compounds with an activity or property. These properties -- including absorption, distribution, metabolism, excretion and toxicity (ADMET) -- are essential to model compound behaviour, activity and interactions in the organism. Although several methods exist, the majority of them do not provide an appropriate model's personalisation, yielding to bias and lack of generalisation to new data since the chemical space usually shifts from application to application. This fact leads to low predictive performance when completely new data is being tested by the model. The area of Automated Machine Learning (AutoML) emerged aiming to solve this issue, outputting tailored ML algorithms to the data at hand. Although an important task, AutoML has not been practically used to assist cheminformatics and computational chemistry researchers of...
A Bayesian mixed-effects model to evaluate the determinants of COVID-19 vaccine uptake in the US.
EN: The COVID-19 pandemic has adversely affected US public health, resulting in over a hundred million cases and more than one million deaths. Vaccination is the key intervention against the COVID-19 pandemic. Multiple COVID-19 vaccines are now available for human use. However, a number of factors, including socio-demographic variables, impact the uptake of COVID-19 vaccines. In this study, we apply a Bayesian mixed-effects model to assess different socio-demographic and spatial factors that influence the acceptance of COVID-19 vaccines in the US. The fitted mixed-effects model provides the probabilistic inference about the vaccine acceptance determinants with uncertainty quantification.
Fast and Accurate Blind Flexible Docking.
EN: Molecular docking that predicts the bound structures of small molecules (ligands) to their protein targets, plays a vital role in drug discovery. However, existing docking methods often face limitations: they either overlook crucial structural changes by assuming protein rigidity or suffer from low computational efficiency due to their reliance on generative models for structure sampling. To address these challenges, we propose FABFlex, a fast and accurate regression-based multi-task learning model designed for realistic blind flexible docking scenarios, where proteins exhibit flexibility and binding pocket sites are unknown (blind). Specifically, FABFlex's architecture comprises three specialized modules working in concert: (1) A pocket prediction module that identifies potential binding sites, addressing the challenges inherent in blind docking scenarios. (2) A ligand docking module that predicts the bound (holo) structures of ligands from their unbound (apo) states. (3) A pocket docking module that forecasts the holo structures of protein pockets from their apo conformations. Notably, FABFlex incorporates an iterative update mechanism that serves as a conduit between the ligand ...
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model.
EN: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model.
EN: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model.
EN: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Mol-LLaMA: Towards General Understanding of Molecules in Large Molecular Language Model.
EN: Understanding molecules is key to understanding organisms and driving advances in drug discovery, requiring interdisciplinary knowledge across chemistry and biology. Although large molecular language models have achieved notable success in task transfer, they often struggle to accurately analyze molecular features due to limited knowledge and reasoning capabilities. To address this issue, we present Mol-LLaMA, a large molecular language model that grasps the general knowledge centered on molecules and exhibits explainability and reasoning ability. To this end, we design key data types that encompass the fundamental molecular features, taking into account the essential abilities for molecular reasoning. Further, to improve molecular understanding, we propose a module that integrates complementary information from different molecular encoders, leveraging the distinct advantages of molecular representations. Our experimental results demonstrate that Mol-LLaMA is capable of comprehending the general features of molecules and providing informative responses, implying its potential as a general-purpose assistant for molecular analysis. Our project page is at https://mol-llama.github.io/.
Towards Quantum Tensor Decomposition in Biomedical Applications.
EN: Tensor decomposition has emerged as a powerful framework for feature extraction in multi-modal biomedical data. In this review, we present a comprehensive analysis of tensor decomposition methods such as Tucker, CANDECOMP/PARAFAC, spiked tensor decomposition, etc. and their diverse applications across biomedical domains such as imaging, multi-omics, and spatial transcriptomics. To systematically investigate the literature, we applied a topic modeling-based approach that identifies and groups distinct thematic sub-areas in biomedicine where tensor decomposition has been used, thereby revealing key trends and research directions. We evaluated challenges related to the scalability of latent spaces along with obtaining the optimal rank of the tensor, which often hinder the extraction of meaningful features from increasingly large and complex datasets. Additionally, we discuss recent advances in quantum algorithms for tensor decomposition, exploring how quantum computing can be leveraged to address these challenges. Our study includes a preliminary resource estimation analysis for quantum computing platforms and examines the feasibility of implementing quantum-enhanced tensor decomposit...
Towards Quantum Tensor Decomposition in Biomedical Applications.
EN: Tensor decomposition has emerged as a powerful framework for feature extraction in multi-modal biomedical data. In this review, we present a comprehensive analysis of tensor decomposition methods such as Tucker, CANDECOMP/PARAFAC, spiked tensor decomposition, etc. and their diverse applications across biomedical domains such as imaging, multi-omics, and spatial transcriptomics. To systematically investigate the literature, we applied a topic modeling-based approach that identifies and groups distinct thematic sub-areas in biomedicine where tensor decomposition has been used, thereby revealing key trends and research directions. We evaluated challenges related to the scalability of latent spaces along with obtaining the optimal rank of the tensor, which often hinder the extraction of meaningful features from increasingly large and complex datasets. Additionally, we discuss recent advances in quantum algorithms for tensor decomposition, exploring how quantum computing can be leveraged to address these challenges. Our study includes a preliminary resource estimation analysis for quantum computing platforms and examines the feasibility of implementing quantum-enhanced tensor decomposit...
Beyond Cortisol! Physiological Indicators of Welfare for Dogs: Deficits, Misunderstandings and Opportunities.
EN: This paper aims to initiate new conversations about the use of physiological indicators when assessing the welfare of dogs. There are significant concerns about construct validity - whether the measures used accurately reflect welfare. The goal is to provide recommendations for future inquiry and encourage debate. We acknowledge that the scientific understanding of animal welfare has evolved and bring attention to the shortcomings of commonly used biomarkers like cortisol. These indicators are frequently used in isolation and with limited salient dog descriptors, so fail to reflect the canine experience adequately. Using a systems approach, we explore various physiological systems and alternative indicators, such as heart rate variability and oxidative stress, to address this limitation. It is essential to consider factors like age, body weight, breed, and sex when interpreting these biomarkers correctly, and researchers should report on these in their studies. This discussion identifies possible indicators for both positive and negative experiences. In conclusion, we advocate for a practical, evidence-based approach to assessing indicators of canine welfare, including non-invasive...
Water-in-water PEG/DEX/protein microgel emulsions: effect of microgel particle size on the rate of emulsion phase separation.
EN: Protein nanoparticles have been proven to be highly effective stabilizers of water-in-water emulsions obtained from a number of different types of aqueous two-phase systems (ATPS). The stabilizing efficiency of such particles is attributed to their affinity to the water/water interface of relevant ATPS, and emulsion formulations with long-term stability were reported in the recent years. In this study we investigated the macroscopic dynamics of the early-stage time evolution of dextran-in-polyethylene glycol emulsions obtained from a single ATPS and containing beta-lactoglobulin microgel particles of various diameters (ca. 40-190 nm). The results revealed the existence of a threshold in microgel size above which the water-in-water emulsion is stabilized, and that the process of segregative phase separation is determined by the interplay of droplets coalescence and sedimentation. Efficient droplet coalescence inhibition was found for microgel particles larger than 60 nm. Based on previous literature results, we discuss our coalescence-driven phase separation data in the context of the formation of durable particle layers on the emulsion droplets and the resulting droplet-droplet int...
Novel computational workflows for natural and biomedical image processing based on hypercomplex algebras.
EN: Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic ...
Novel computational workflows for natural and biomedical image processing based on hypercomplex algebras.
EN: Hypercomplex image processing extends conventional techniques in a unified paradigm encompassing algebraic and geometric principles. This work leverages quaternions and the two-dimensional orthogonal planes split framework (splitting of a quaternion - representing a pixel - into pairs of orthogonal 2D planes) for natural/biomedical image analysis through the following computational workflows and outcomes: natural/biomedical image re-colorization, natural image de-colorization, natural/biomedical image contrast enhancement, computational re-staining and stain separation in histological images, and performance gains in machine/deep learning pipelines for histological images. The workflows are analyzed separately for natural and biomedical images to showcase the effectiveness of the proposed approaches. The proposed workflows can regulate color appearance (e.g. with alternative renditions and grayscale conversion) and image contrast, be part of automated image processing pipelines (e.g. isolating stain components, boosting learning models), and assist in digital pathology applications (e.g. enhancing biomarker visibility, enabling colorblind-friendly renditions). Employing only basic ...
Supervised contrastive learning for cell stage classification of animal embryos.
EN: Video microscopy, when combined with machine learning, offers a promising approach for studying the early development of in vitro produced (IVP) embryos. However, manually annotating developmental events, and more specifically cell divisions, is time-consuming for a biologist and cannot scale up for practical applications. We aim to automatically classify the cell stages of embryos from 2D time-lapse microscopy videos with a deep learning approach. We focus on the analysis of bovine embryonic development using video microscopy, as we are primarily interested in the application of cattle breeding, and we have created a Bovine Embryos Cell Stages (ECS) dataset. The challenges are three-fold: (1) low-quality images and bovine dark cells that make the identification of cell stages difficult, (2) class ambiguity at the boundaries of developmental stages, and (3) imbalanced data distribution. To address these challenges, we introduce CLEmbryo, a novel method that leverages supervised contrastive learning combined with focal loss for training, and the lightweight 3D neural network CSN-50 as an encoder. We also show that our method generalizes well. CLEmbryo outperforms state-of-the-art me...
Generating 3D Binding Molecules Using Shape-Conditioned Diffusion Models with Guidance.
EN: Drug development is a critical but notoriously resource- and time-consuming process. In this manuscript, we develop a novel generative artificial intelligence (genAI) method DiffSMol to facilitate drug development. DiffSmol generates 3D binding molecules based on the shapes of known ligands. DiffSMol encapsulates geometric details of ligand shapes within pre-trained, expressive shape embeddings and then generates new binding molecules through a diffusion model. DiffSMol further modifies the generated 3D structures iteratively via shape guidance to better resemble the ligand shapes. It also tailors the generated molecules toward optimal binding affinities under the guidance of protein pockets. Here, we show that DiffSMol outperforms the state-of-the-art methods on benchmark datasets. When generating binding molecules resembling ligand shapes, DiffSMol with shape guidance achieves a success rate 61.4%, substantially outperforming the best baseline (11.2%), meanwhile producing molecules with novel molecular graph structures. DiffSMol with pocket guidance also outperforms the best baseline in binding affinities by 13.2%, and even by 17.7% when combined with shape guidance. Case studies...
LLMs for Drug-Drug Interaction Prediction: A Comprehensive Comparison.
EN: The increasing volume of drug combinations in modern therapeutic regimens needs reliable methods for predicting drug-drug interactions (DDIs). While Large Language Models (LLMs) have revolutionized various domains, their potential in pharmaceutical research, particularly in DDI prediction, remains largely unexplored. This study thoroughly investigates LLMs' capabilities in predicting DDIs by uniquely processing molecular structures (SMILES), target organisms, and gene interaction data as raw text input from the latest DrugBank dataset. We evaluated 18 different LLMs, including proprietary models (GPT-4, Claude, Gemini) and open-source variants (from 1.5B to 72B parameters), first assessing their zero-shot capabilities in DDI prediction. We then fine-tuned selected models (GPT-4, Phi-3.5 2.7B, Qwen-2.5 3B, Gemma-2 9B, and Deepseek R1 distilled Qwen 1.5B) to optimize their performance. Our comprehensive evaluation framework included validation across 13 external DDI datasets, comparing against traditional approaches such as l2-regularized logistic regression. Fine-tuned LLMs demonstrated superior performance, with Phi-3.5 2.7B achieving a sensitivity of 0.978 in DDI prediction, with ...
Temporal Distribution Shift in Real-World Pharmaceutical Data: Implications for Uncertainty Quantification in QSAR Models.
EN: The estimation of uncertainties associated with predictions from quantitative structure-activity relationship (QSAR) models can accelerate the drug discovery process by identifying promising experiments and allowing an efficient allocation of resources. Several computational tools exist that estimate the predictive uncertainty in machine learning models. However, deviations from the i.i.d. setting have been shown to impair the performance of these uncertainty quantification methods. We use a real-world pharmaceutical dataset to address the pressing need for a comprehensive, large-scale evaluation of uncertainty estimation methods in the context of realistic distribution shifts over time. We investigate the performance of several uncertainty estimation methods, including ensemble-based and Bayesian approaches. Furthermore, we use this real-world setting to systematically assess the distribution shifts in label and descriptor space and their impact on the capability of the uncertainty estimation methods. Our study reveals significant shifts over time in both label and descriptor space and a clear connection between the magnitude of the shift and the nature of the assay. Moreover, we ...
Deep Learning-Based Approach for Identification of Potato Leaf Diseases Using Wrapper Feature Selection and Feature Concatenation.
EN: The potato is a widely grown crop in many regions of the world. In recent decades, potato farming has gained incredible traction in the world. Potatoes are susceptible to several illnesses that stunt their development. This plant seems to have significant leaf disease. Early Blight and Late Blight are two prevalent leaf diseases that affect potato plants. The early detection of these diseases would be beneficial for enhancing the yield of this crop. The ideal solution is to use image processing to identify and analyze these disorders. Here, we present an autonomous method based on image processing and machine learning to detect late blight disease affecting potato leaves. The proposed method comprises four different phases: (1) Histogram Equalization is used to improve the quality of the input image; (2) feature extraction is performed using a Deep CNN model, then these extracted features are concatenated; (3) feature selection is performed using wrapper-based feature selection; (4) classification is performed using an SVM classifier and its variants. This proposed method achieves the highest accuracy of 99% using SVM by selecting 550 features.
JingFang: An Expert-Level Large Language Model for Traditional Chinese Medicine Clinical Consultation and Syndrome Differentiation-Based Treatment.
EN: The effective application of traditional Chinese medicine (TCM) requires extensive knowledge of TCM and clinical experience. The emergence of Large Language Models (LLMs) provides a solution to this, while existing LLMs for TCM exhibit critical limitations of incomplete clinical consultation and diagnoses, as well as inaccurate syndrome differentiation. To address these issues, we establish JingFang (JF), a novel TCM LLM that demonstrates the level of expertise in clinical consultation and syndrome differentiation. We propose a Multi-Agent Collaborative Chain-of-Thought Mechanism (MACCTM) for comprehensive and targeted clinical consultation, enabling JF with effective and accurate diagnostic ability. In addition, a Syndrome Agent and a Dual-Stage Recovery Scheme (DSRS) are developed to accurately enhance the differentiation of the syndrome and the subsequent corresponding treatment. JingFang not only facilitates the application of LLMs but also promotes the effective application of TCM for healthcare.
Navigating the Fragrance space Via Graph Generative Models And Predicting Odors.
EN: We explore a suite of generative modelling techniques to efficiently navigate and explore the complex landscapes of odor and the broader chemical space. Unlike traditional approaches, we not only generate molecules but also predict the odor likeliness with ROC AUC score of 0.97 and assign probable odor labels. We correlate odor likeliness with physicochemical features of molecules using machine learning techniques and leverage SHAP (SHapley Additive exPlanations) to demonstrate the interpretability of the function. The whole process involves four key stages: molecule generation, stringent sanitization checks for molecular validity, fragrance likeliness screening and odor prediction of the generated molecules. By making our code and trained models publicly accessible, we aim to facilitate broader adoption of our research across applications in fragrance discovery and olfactory research.
An SIRS-model considering waning efficiency and periodic re-vaccination.
EN: In this paper, we extend the classical SIRS (Susceptible-Infectious-Recovered-Susceptible) model from mathematical epidemiology by incorporating a vaccinated compartment, V, accounting for an imperfect vaccine with waning efficacy over time. The SIRSV-model divides the population into four compartments and introduces periodic re-vaccination for waning immunity. The efficiency of the vaccine is assumed to decay with the time passed since the vaccination. Periodic re-vaccinations are applied to the population. We develop a partial differential equation (PDE) model for the continuous vaccination time and a coupled ordinary differential equation (ODE) system when discretizing the vaccination period. We analyze the equilibria of the ODE model and investigate the linear stability of the disease-free equilibrium (DFE). Furthermore, we explore an optimization framework where vaccination rate, re-vaccination time, and non-pharmaceutical interventions (NPIs) are control variables to minimize infection levels. The optimization objective is defined using different norm-based measures of infected individuals. A numerical analysis of the model's dynamic behavior under varying control parameters ...
Group Ligands Docking to Protein Pockets.
EN: Molecular docking is a key task in computational biology that has attracted increasing interest from the machine learning community. While existing methods have achieved success, they generally treat each protein-ligand pair in isolation. Inspired by the biochemical observation that ligands binding to the same target protein tend to adopt similar poses, we propose \textsc{GroupBind}, a novel molecular docking framework that simultaneously considers multiple ligands docking to a protein. This is achieved by introducing an interaction layer for the group of ligands and a triangle attention module for embedding protein-ligand and group-ligand pairs. By integrating our approach with diffusion-based docking model, we set a new S performance on the PDBBind blind docking benchmark, demonstrating the effectiveness of our proposed molecular docking paradigm.
Procedural Generation of 3D Maize Plant Architecture from LIDAR Data.
EN: This study introduces a robust framework for generating procedural 3D models of maize (Zea mays) plants from LiDAR point cloud data, offering a scalable alternative to traditional field-based phenotyping. Our framework leverages Non-Uniform Rational B-Spline (NURBS) surfaces to model the leaves of maize plants, combining Particle Swarm Optimization (PSO) for an initial approximation of the surface and a differentiable programming framework for precise refinement of the surface to fit the point cloud data. In the first optimization phase, PSO generates an approximate NURBS surface by optimizing its control points, aligning the surface with the LiDAR data, and providing a reliable starting point for refinement. The second phase uses NURBS-Diff, a differentiable programming framework, to enhance the accuracy of the initial fit by refining the surface geometry and capturing intricate leaf details. Our results demonstrate that, while PSO establishes a robust initial fit, the integration of differentiable NURBS significantly improves the overall quality and fidelity of the reconstructed surface. This hierarchical optimization strategy enables accurate 3D reconstruction of maize leaves ac...
A Survey on Memory-Efficient Transformer-Based Model Training in AI for Science.
EN: Scientific research faces high costs and inefficiencies with traditional methods, but the rise of deep learning and large language models (LLMs) offers innovative solutions. This survey reviews transformer-based LLM applications across scientific fields such as biology, medicine, chemistry, and meteorology, underscoring their role in advancing research. However, the continuous expansion of model size has led to significant memory demands, hindering further development and application of LLMs for science. This survey systematically reviews and categorizes memory-efficient pre-training techniques for large-scale transformers, including algorithm-level, system-level, and hardware-software co-optimization. Using AlphaFold 2 as an example, we demonstrate how tailored memory optimization methods can reduce storage needs while preserving prediction accuracy. By bridging model efficiency and scientific application needs, we hope to provide insights for scalable and cost-effective LLM training in AI for science.
VECT-GAN: A variationally encoded generative model for overcoming data scarcity in pharmaceutical science.
EN: Data scarcity in pharmaceutical research has led to reliance on labour-intensive trial-and-error approaches for development rather than data-driven methods. While Machine Learning offers a solution, existing datasets are often small and noisy, limiting their utility. To address this, we developed a Variationally Encoded Conditional Tabular Generative Adversarial Network (VECT-GAN), a novel generative model specifically designed for augmenting small, noisy datasets. We introduce a pipeline where data is augmented before regression model development and demonstrate that this consistently and significantly improves performance over other state-of-the-art tabular generative models. We apply this pipeline across six pharmaceutical datasets, and highlight its real-world applicability by developing novel polymers with medically desirable mucoadhesive properties, which we made and experimentally characterised. Additionally, we pre-train the model on the ChEMBL database of drug-like molecules, leveraging knowledge distillation to enhance its generalisability, making it readily available for use on pharmaceutical datasets containing small molecules, an extremely common pharmaceutical task. W...
Early prediction of the transferability of bovine embryos from videomicroscopy.
EN: Videomicroscopy is a promising tool combined with machine learning for studying the early development of in vitro fertilized bovine embryos and assessing its transferability as soon as possible. We aim to predict the embryo transferability within four days at most, taking 2D time-lapse microscopy videos as input. We formulate this problem as a supervised binary classification problem for the classes transferable and not transferable. The challenges are three-fold: 1) poorly discriminating appearance and motion, 2) class ambiguity, 3) small amount of annotated data. We propose a 3D convolutional neural network involving three pathways, which makes it multi-scale in time and able to handle appearance and motion in different ways. For training, we retain the focal loss. Our model, named SFR, compares favorably to other methods. Experiments demonstrate its effectiveness and accuracy for our challenging biological task.
D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation.
EN: Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.
D3MES: Diffusion Transformer with multihead equivariant self-attention for 3D molecule generation.
EN: Understanding and predicting the diverse conformational states of molecules is crucial for advancing fields such as chemistry, material science, and drug development. Despite significant progress in generative models, accurately generating complex and biologically or material-relevant molecular structures remains a major challenge. In this work, we introduce a diffusion model for three-dimensional (3D) molecule generation that combines a classifiable diffusion model, Diffusion Transformer, with multihead equivariant self-attention. This method addresses two key challenges: correctly attaching hydrogen atoms in generated molecules through learning representations of molecules after hydrogen atoms are removed; and overcoming the limitations of existing models that cannot generate molecules across multiple classes simultaneously. The experimental results demonstrate that our model not only achieves state-of-the-art performance across several key metrics but also exhibits robustness and versatility, making it highly suitable for early-stage large-scale generation processes in molecular design, followed by validation and further screening to obtain molecules with specific properties.
Halal or Not: Knowledge Graph Completion for Predicting Cultural Appropriateness of Daily Products.
EN: The growing demand for halal cosmetic products has exposed significant challenges, especially in Muslim-majority countries. Recently, various machine learning-based strategies, e.g., image-based methods, have shown remarkable success in predicting the halal status of cosmetics. However, these methods mainly focus on analyzing the discrete and specific ingredients within separate cosmetics, which ignore the high-order and complex relations between cosmetics and ingredients. To address this problem, we propose a halal cosmetic recommendation framework, namely HaCKG, that leverages a knowledge graph of cosmetics and their ingredients to explicitly model and capture the relationships between cosmetics and their components. By representing cosmetics and ingredients as entities within the knowledge graph, HaCKG effectively learns the high-order and complex relations between entities, offering a robust method for predicting halal status. Specifically, we first construct a cosmetic knowledge graph representing the relations between various cosmetics, ingredients, and their properties. We then propose a pre-trained relational graph attention network model with residual connections to learn ...
Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs.
EN: Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://...
Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs.
EN: Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://...
A Study on Context Length and Efficient Transformers for Biomedical Image Analysis.
EN: Biomedical imaging modalities often produce high-resolution, multi-dimensional images that pose computational challenges for deep neural networks. These computational challenges are compounded when training transformers due to the self-attention operator, which scales quadratically with context length. Recent developments in long-context models have potential to alleviate these difficulties and enable more efficient application of transformers to large biomedical images, although a systematic evaluation on this topic is lacking. In this study, we investigate the impact of context length on biomedical image analysis and we evaluate the performance of recently proposed long-context models. We first curate a suite of biomedical imaging datasets, including 2D and 3D data for segmentation, denoising, and classification tasks. We then analyze the impact of context length on network performance using the Vision Transformer and Swin Transformer by varying patch size and attention window size. Our findings reveal a strong relationship between context length and performance, particularly for pixel-level prediction tasks. Finally, we show that recent long-context models demonstrate significan...
A Study on Context Length and Efficient Transformers for Biomedical Image Analysis.
EN: Biomedical imaging modalities often produce high-resolution, multi-dimensional images that pose computational challenges for deep neural networks. These computational challenges are compounded when training transformers due to the self-attention operator, which scales quadratically with context length. Recent developments in long-context models have potential to alleviate these difficulties and enable more efficient application of transformers to large biomedical images, although a systematic evaluation on this topic is lacking. In this study, we investigate the impact of context length on biomedical image analysis and we evaluate the performance of recently proposed long-context models. We first curate a suite of biomedical imaging datasets, including 2D and 3D data for segmentation, denoising, and classification tasks. We then analyze the impact of context length on network performance using the Vision Transformer and Swin Transformer by varying patch size and attention window size. Our findings reveal a strong relationship between context length and performance, particularly for pixel-level prediction tasks. Finally, we show that recent long-context models demonstrate significan...
Analyzing Country-Level Vaccination Rates and Determinants of Practical Capacity to Administer COVID-19 Vaccines.
EN: The COVID-19 vaccine development, manufacturing, transportation, and administration proved an extreme logistics operation of global magnitude. Global vaccination levels, however, remain a key concern in preventing the emergence of new strains and minimizing the impact of the pandemic's disruption of daily life. In this paper, country-level vaccination rates are analyzed through a queuing framework to extract service rates that represent the practical capacity of a country to administer vaccines. These rates are further characterized through regression and interpretable machine learning methods with country-level demographic, governmental, and socio-economic variates. Model results show that participation in multi-governmental collaborations such as COVAX may improve the ability to vaccinate. Similarly, improved transportation and accessibility variates such as roads per area for low-income countries and rail lines per area for high-income countries can improve rates. It was also found that for low-income countries specifically, improvements in basic and health infrastructure (as measured through spending on healthcare, number of doctors and hospital beds per 100k, population percen...
Old vaccines, new usages, surprisingly effective in solving the century-old problem -Inactivated African Swine Fever Virus vaccine induces safe and efficient immune protection through mucosal immunity.
EN: Background: African swine fever is among the most devastating viral diseases of pigs. Despite nearly a century of research, there is still no safe and effective vaccine available. The current situation is that either vaccines are safe but not effective, or they are effective but not safe.Findings: The ASF vaccine prepared using the inactivation method with propiolactone provided 98.6% protection within 100 days after three intranasal immunizations, spaced 7 days apart.Conclusions: An inactivated vaccine made from complete African swine fever virus particles using propiolactone is safe and effective for controlling ASF through mucosal immunity.
FairDD: Enhancing Fairness with domain-incremental learning in dermatological disease diagnosis.
EN: With the rapid advancement of deep learning technologies, artificial intelligence has become increasingly prevalent in the research and application of dermatological disease diagnosis. However, this data-driven approach often faces issues related to decision bias. Existing fairness enhancement techniques typically come at a substantial cost to accuracy. This study aims to achieve a better trade-off between accuracy and fairness in dermatological diagnostic models. To this end, we propose a novel fair dermatological diagnosis network, named FairDD, which leverages domain incremental learning to balance the learning of different groups by being sensitive to changes in data distribution. Additionally, we incorporate the mixup data augmentation technique and supervised contrastive learning to enhance the network's robustness and generalization. Experimental validation on two dermatological datasets demonstrates that our proposed method excels in both fairness criteria and the trade-off between fairness and performance.
Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm.
EN: In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel m...
Canine EEG Helps Human: Cross-Species and Cross-Modality Epileptic Seizure Detection via Multi-Space Alignment.
EN: Epilepsy significantly impacts global health, affecting about 65 million people worldwide, along with various animal species. The diagnostic processes of epilepsy are often hindered by the transient and unpredictable nature of seizures. Here we propose a multi-space alignment approach based on cross-species and cross-modality electroencephalogram (EEG) data to enhance the detection capabilities and understanding of epileptic seizures. By employing deep learning techniques, including domain adaptation and knowledge distillation, our framework aligns cross-species and cross-modality EEG signals to enhance the detection capability beyond traditional within-species and with-modality models. Experiments on multiple surface and intracranial EEG datasets of humans and canines demonstrated substantial improvements in the detection accuracy, achieving over 90% AUC scores for cross-species and cross-modality seizure detection with extremely limited labeled data from the target species/modality. To our knowledge, this is the first study that demonstrates the effectiveness of integrating heterogeneous data from different species and modalities to improve EEG-based seizure detection performance...
Decoding Drug Discovery: Exploring A-to-Z In silico Methods for Beginners.
EN: The drug development process is a critical challenge in the pharmaceutical industry due to its time-consuming nature and the need to discover new drug potentials to address various ailments. The initial step in drug development, drug target identification, often consumes considerable time. While valid, traditional methods such as in vivo and in vitro approaches are limited in their ability to analyze vast amounts of data efficiently, leading to wasteful outcomes. To expedite and streamline drug development, an increasing reliance on computer-aided drug design (CADD) approaches has merged. These sophisticated in silico methods offer a promising avenue for efficiently identifying viable drug candidates, thus providing pharmaceutical firms with significant opportunities to uncover new prospective drug targets. The main goal of this work is to review in silico methods used in the drug development process with a focus on identifying therapeutic targets linked to specific diseases at the genetic or protein level. This article thoroughly discusses A-to-Z in silico techniques, which are essential for identifying the targets of bioactive compounds and their potential therapeutic effects. Th...
FlowDock: Geometric Flow Matching for Generative Protein-Ligand Docking and Affinity Prediction.
EN: Powerful generative AI models of protein-ligand structure have recently been proposed, but few of these methods support both flexible protein-ligand docking and affinity estimation. Of those that do, none can directly model multiple binding ligands concurrently or have been rigorously benchmarked on pharmacologically relevant drug targets, hindering their widespread adoption in drug discovery efforts. In this work, we propose FlowDock, the first deep geometric generative model based on conditional flow matching that learns to directly map unbound (apo) structures to their bound (holo) counterparts for an arbitrary number of binding ligands. Furthermore, FlowDock provides predicted structural confidence scores and binding affinity values with each of its generated protein-ligand complex structures, enabling fast virtual screening of new (multi-ligand) drug targets. For the well-known PoseBusters Benchmark dataset, FlowDock outperforms single-sequence AlphaFold 3 with a 51% blind docking success rate using unbound (apo) protein input structures and without any information derived from multiple sequence alignments, and for the challenging new DockGen-E dataset, FlowDock outperforms si...
DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting.
EN: Multivariate time series forecasting is crucial for various applications, such as financial investment, energy management, weather forecasting, and traffic optimization. However, accurate forecasting is challenging due to two main factors. First, real-world time series often show heterogeneous temporal patterns caused by distribution shifts over time. Second, correlations among channels are complex and intertwined, making it hard to model the interactions among channels precisely and flexibly. In this study, we address these challenges by proposing a general framework called DUET, which introduces dual clustering on the temporal and channel dimensions to enhance multivariate time series forecasting. First, we design a Temporal Clustering Module (TCM) that clusters time series into fine-grained distributions to handle heterogeneous temporal patterns. For different distribution clusters, we design various pattern extractors to capture their intrinsic temporal patterns, thus modeling the heterogeneity. Second, we introduce a novel Channel-Soft-Clustering strategy and design a Channel Clustering Module (CCM), which captures the relationships among channels in the frequency domain thr...
Decoding Poultry Vocalizations -- Natural Language Processing and Transformer Models for Semantic and Emotional Analysis.
EN: Deciphering the acoustic language of chickens offers new opportunities in animal welfare and ecological informatics. Their subtle vocal signals encode health conditions, emotional states, and dynamic interactions within ecosystems. Understanding the semantics of these calls provides a valuable tool for interpreting their functional vocabulary and clarifying how each sound serves a specific purpose in social and environmental contexts. We apply advanced Natural Language Processing and transformer based models to translate bioacoustic data into meaningful insights. Our method integrates Wave2Vec 2.0 for raw audio feature extraction with a fine tuned Bidirectional Encoder Representations from Transformers model, pretrained on a broad corpus of animal sounds and adapted to poultry tasks. This pipeline decodes poultry vocalizations into interpretable categories including distress calls, feeding signals, and mating vocalizations, revealing emotional nuances often overlooked by conventional analyses. Achieving 92 percent accuracy in classifying key vocalization types, our approach demonstrates the feasibility of real time automated monitoring of flock health and stress. By tracking this f...
In Silico Pharmacokinetic and Molecular Docking Studies of Natural Plants against Essential Protein KRAS for Treatment of Pancreatic Cancer.
EN: A kind of pancreatic cancer called Pancreatic Ductal Adenocarcinoma (PDAC) is anticipated to be one of the main causes of mortality during past years. Evidence from several researches supported the concept that the oncogenic KRAS (Ki-ras2 Kirsten rat sarcoma viral oncogene) mutation is the major cause of pancreatic cancer. KRAS acts as an on-off switch that promotes cell growth. But when the KRAS gene is mutated, it will be in one position, allowing the cell growth uncontrollably. This uncontrollable multiplication of cells causes cancer growth. Therefore, KRAS was selected as the target protein in the study. Fifty plant-derived compounds are selected for the study. To determine whether the examined drugs could bind to the KRAS complex's binding pocket, molecular docking was performed. Computational analyses were used to assess the possible ability of tested substances to pass the Blood Brain Barrier (BBB). To predict the bioactivity of ligands a machine learning model was created. Five machine learning models were created and have chosen the best one among them for analyzing the bioactivity of each ligand. From the fifty plant-derived compounds the compounds with the least binding...
KITE-DDI: A Knowledge graph Integrated Transformer Model for accurately predicting Drug-Drug Interaction Events from Drug SMILES and Biomedical Knowledge Graph.
EN: It is a common practice in modern medicine to prescribe multiple medications simultaneously to treat diseases. However, these medications could have adverse reactions between them, known as Drug-Drug Interactions (DDI), which have the potential to cause significant bodily injury and could even be fatal. Hence, it is essential to identify all the DDI events before prescribing multiple drugs to a patient. Most contemporary research for predicting DDI events relies on either information from Biomedical Knowledge graphs (KG) or drug SMILES, with very few managing to merge data from both to make predictions. While others use heuristic algorithms to extract features from SMILES and KGs, which are then fed into a Deep Learning framework to generate output. In this study, we propose a KG-integrated Transformer architecture to generate an end-to-end fully automated Machine Learning pipeline for predicting DDI events with high accuracy. The algorithm takes full-scale molecular SMILES sequences of a pair of drugs and a biomedical KG as input and predicts the interaction between the two drugs with high precision. The results show superior performance in two different benchmark datasets compare...
Power Plant Detection for Energy Estimation using GIS with Remote Sensing, CNN & Vision Transformers.
EN: In this research, we propose a hybrid model for power plant detection to assist energy estimation applications, by pipelining GIS (Geographical Information Systems) having Remote Sensing capabilities with CNN (Convolutional Neural Networks) and ViT (Vision Transformers). Our proposed approach enables real-time analysis with multiple data types on a common map via the GIS, entails feature-extraction abilities due to the CNN, and captures long-range dependencies through the ViT. This hybrid approach is found to enhance classification, thus helping in the monitoring and operational management of power plants; hence assisting energy estimation and sustainable energy planning in the future. It exemplifies adequate deployment of machine learning methods in conjunction with domain-specific approaches to enhance performance.
Deep-Learning Based Docking Methods: Fair Comparisons to Conventional Docking Workflows.
EN: The diffusion learning method, DiffDock, for docking small-molecule ligands into protein binding sites was recently introduced. Results included comparisons to more conventional docking approaches, with DiffDock showing superior performance. Here, we employ a fully automatic workflow using the Surflex-Dock methods to generate a fair baseline for conventional docking approaches. Results were generated for the common and expected situation where a binding site location is known and also for the condition of an unknown binding site. For the known binding site condition, Surflex-Dock success rates at 2.0 Angstroms RMSD far exceeded those for DiffDock (Top-1/Top-5 success rates, respectively, were 68/81% compared with 45/51%). Glide performed with similar success rates (67/73%) to Surflex-Dock for the known binding site condition, and results for AutoDock Vina and Gnina followed this pattern. For the unknown binding site condition, using an automated method to identify multiple binding pockets, Surflex-Dock success rates again exceeded those of DiffDock, but by a somewhat lesser margin. DiffDock made use of roughly 17,000 co-crystal structures for learning (98% of PDBBind version 2020, ...
Automating grapevine LAI features estimation with UAV imagery and machine learning.
EN: The leaf area index determines crop health and growth. Traditional methods for calculating it are time-consuming, destructive, costly, and limited to a scale. In this study, we automate the index estimation method using drone image data of grapevine plants and a machine learning model. Traditional feature extraction and deep learning methods are used to obtain helpful information from the data and enhance the performance of the different machine learning models employed for the leaf area index prediction. The results showed that deep learning based feature extraction is more effective than traditional methods. The new approach is a significant improvement over old methods, offering a faster, non-destructive, and cost-effective leaf area index calculation, which enhances precision agriculture practices.
Enhancing Molecular Design through Graph-based Topological Reinforcement Learning.
EN: The generation of drug-like molecules is crucial for drug design. Existing reinforcement learning (RL) methods often overlook structural information. However, feature engineering-based methods usually merely focus on binding affinity prediction without substantial molecular modification. To address this, we present Graph-based Topological Reinforcement Learning (GraphTRL), which integrates both chemical and structural data for improved molecular generation. GraphTRL leverages multiscale weighted colored graphs (MWCG) and persistent homology, combined with molecular fingerprints, as the state space for RL. Evaluations show that GraphTRL outperforms existing methods in binding affinity prediction, offering a promising approach to accelerate drug discovery.
Assessing data-driven predictions of band gap and electrical conductivity for transparent conducting materials.
EN: Machine Learning (ML) has offered innovative perspectives for accelerating the discovery of new functional materials, leveraging the increasing availability of material databases. Despite the promising advances, data-driven methods face constraints imposed by the quantity and quality of available data. Moreover, ML is often employed in tandem with simulated datasets originating from density functional theory (DFT), and assessed through in-sample evaluation schemes. This scenario raises questions about the practical utility of ML in uncovering new and significant material classes for industrial applications. Here, we propose a data-driven framework aimed at accelerating the discovery of new transparent conducting materials (TCMs), an important category of semiconductors with a wide range of applications. To mitigate the shortage of available data, we create and validate unique experimental databases, comprising several examples of existing TCMs. We assess state-of-the-art (SOTA) ML models for property prediction from the stoichiometry alone. We propose a bespoke evaluation scheme to provide empirical evidence on the ability of ML to uncover new, previously unseen materials of intere...
A Multimodal Approach to The Detection and Classification of Skin Diseases.
EN: According to PBS, nearly one-third of Americans lack access to primary care services, and another forty percent delay going to avoid medical costs. As a result, many diseases are left undiagnosed and untreated, even if the disease shows many physical symptoms on the skin. With the rise of AI, self-diagnosis and improved disease recognition have become more promising than ever; in spite of that, existing methods suffer from a lack of large-scale patient databases and outdated methods of study, resulting in studies being limited to only a few diseases or modalities. This study incorporates readily available and easily accessible patient information via image and text for skin disease classification on a new dataset of 26 skin disease types that includes both skin disease images (37K) and associated patient narratives. Using this dataset, baselines for various image models were established that outperform existing methods. Initially, the Resnet-50 model was only able to achieve an accuracy of 70% but, after various optimization techniques, the accuracy was improved to 80%. In addition, this study proposes a novel fine-tuning strategy for sequence classification Large Language Models (...
GNNAS-Dock: Budget Aware Algorithm Selection with Graph Neural Networks for Molecular Docking.
EN: Molecular docking is a major element in drug discovery and design. It enables the prediction of ligand-protein interactions by simulating the binding of small molecules to proteins. Despite the availability of numerous docking algorithms, there is no single algorithm consistently outperforms the others across a diverse set of docking scenarios. This paper introduces GNNAS-Dock, a novel Graph Neural Network (GNN)-based automated algorithm selection system for molecular docking in blind docking situations. GNNs are accommodated to process the complex structural data of both ligands and proteins. They benefit from the inherent graph-like properties to predict the performance of various docking algorithms under different conditions. The present study pursues two main objectives: 1) predict the performance of each candidate docking algorithm, in terms of Root Mean Square Deviation (RMSD), thereby identifying the most accurate method for specific scenarios; and 2) choose the best computationally efficient docking algorithm for each docking case, aiming to reduce the time required for docking while maintaining high accuracy. We validate our approach on PDBBind 2020 refined set, which cont...
Graph Neural Networks for Quantifying Compatibility Mechanisms in Traditional Chinese Medicine.
EN: Traditional Chinese Medicine (TCM) involves complex compatibility mechanisms characterized by multi-component and multi-target interactions, which are challenging to quantify. To address this challenge, we applied graph artificial intelligence to develop a TCM multi-dimensional knowledge graph that bridges traditional TCM theory and modern biomedical science (https://zenodo.org/records/13763953 ). Using feature engineering and embedding, we processed key TCM terminology and Chinese herbal pieces (CHP), introducing medicinal properties as virtual nodes and employing graph neural networks with attention mechanisms to model and analyze 6,080 Chinese herbal formulas (CHF). Our method quantitatively assessed the roles of CHP within CHF and was validated using 215 CHF designed for COVID-19 management. With interpretable models, open-source data, and code (https://github.com/ZENGJingqi/GraphAI-for-TCM ), this study provides robust tools for advancing TCM theory and drug discovery.
GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules.
EN: Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules' properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-state geometric structures and biomedical texts, resulting in a PubChem3D dataset. Based on this dataset, we propose the GeomCLIP framework to enhance for multi-modal representation learning from molecular structures and biomedical text. During pre-training, we design two types of tasks, i.e., multimodal representation alignment and unimodal denoising pretraining, to align the 3D geometric encoder with textual information and, at the same time, preserve its original representation power. Experimental results show the effectiveness of GeomCLIP in various tasks such as molecular property prediction, zero-shot text-molecule retrieval, and 3D molecule captioning. Our code and collected dataset are available at \url{https://github.com/xiaocui3737/GeomCLIP}
GeomCLIP: Contrastive Geometry-Text Pre-training for Molecules.
EN: Pretraining molecular representations is crucial for drug and material discovery. Recent methods focus on learning representations from geometric structures, effectively capturing 3D position information. Yet, they overlook the rich information in biomedical texts, which detail molecules' properties and substructures. With this in mind, we set up a data collection effort for 200K pairs of ground-state geometric structures and biomedical texts, resulting in a PubChem3D dataset. Based on this dataset, we propose the GeomCLIP framework to enhance for multi-modal representation learning from molecular structures and biomedical text. During pre-training, we design two types of tasks, i.e., multimodal representation alignment and unimodal denoising pretraining, to align the 3D geometric encoder with textual information and, at the same time, preserve its original representation power. Experimental results show the effectiveness of GeomCLIP in various tasks such as molecular property prediction, zero-shot text-molecule retrieval, and 3D molecule captioning. Our code and collected dataset are available at \url{https://github.com/xiaocui3737/GeomCLIP}
Causal Representation Learning from Multimodal Biomedical Observations.
EN: Prevalent in biomedical applications (e.g., human phenotype research), multimodal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learning have shown promise in identifying interpretable latent causal variables with formal theoretical guarantees. Unfortunately, most current work on multimodal distributions either relies on restrictive parametric assumptions or yields only coarse identification results, limiting their applicability to biomedical research that favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Theoretically, we consider a nonparametric latent distribution (c.f., parametric assumptions in previous work) that allows for causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component...
Causal Representation Learning from Multimodal Biomedical Observations.
EN: Prevalent in biomedical applications (e.g., human phenotype research), multimodal datasets can provide valuable insights into the underlying physiological mechanisms. However, current machine learning (ML) models designed to analyze these datasets often lack interpretability and identifiability guarantees, which are essential for biomedical research. Recent advances in causal representation learning have shown promise in identifying interpretable latent causal variables with formal theoretical guarantees. Unfortunately, most current work on multimodal distributions either relies on restrictive parametric assumptions or yields only coarse identification results, limiting their applicability to biomedical research that favors a detailed understanding of the mechanisms. In this work, we aim to develop flexible identification conditions for multimodal data and principled methods to facilitate the understanding of biomedical datasets. Theoretically, we consider a nonparametric latent distribution (c.f., parametric assumptions in previous work) that allows for causal relationships across potentially different modalities. We establish identifiability guarantees for each latent component...
Anticipatory Understanding of Resilient Agriculture to Climate.
EN: With billions of people facing moderate or severe food insecurity, the resilience of the global food supply will be of increasing concern due to the effects of climate change and geopolitical events. In this paper we describe a framework to better identify food security hotspots using a combination of remote sensing, deep learning, crop yield modeling, and causal modeling of the food distribution system. While we feel that the methods are adaptable to other regions of the world, we focus our analysis on the wheat breadbasket of northern India, which supplies a large percentage of the world's population. We present a quantitative analysis of deep learning domain adaptation methods for wheat farm identification based on curated remote sensing data from France. We model climate change impacts on crop yields using the existing crop yield modeling tool WOFOST and we identify key drivers of crop simulation error using a longitudinal penalized functional regression. A description of a system dynamics model of the food distribution system in India is also presented, along with results of food insecurity identification based on seeding this model with the predicted crop yields.
Integrating Large Language Models for Genetic Variant Classification.
EN: The classification of genetic variants, particularly Variants of Uncertain Significance (VUS), poses a significant challenge in clinical genetics and precision medicine. Large Language Models (LLMs) have emerged as transformative tools in this realm. These models can uncover intricate patterns and predictive insights that traditional methods might miss, thus enhancing the predictive accuracy of genetic variant pathogenicity. This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense, which leverage DNA and protein sequence data alongside structural insights to form a comprehensive analytical framework for variant classification. Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets, setting new benchmarks in classification performance. The models were rigorously tested on a set of challenging variants, demonstrating substantial improvements over existing state-of-the-art tools, especially in handling ambiguous and clinically uncertain variants. The results of this research underline the efficacy of combining multiple modeling approaches to significantly refine the accuracy and r...
Bayesian algorithmic perfumery: A Hierarchical Relevance Vector Machine for the Estimation of Personalized Fragrance Preferences based on Three Sensory Layers and Jungian Personality Archetypes.
EN: This study explores a Bayesian algorithmic approach to personalized fragrance recommendation by integrating hierarchical Relevance Vector Machines (RVM) and Jungian personality archetypes. The paper proposes a structured model that links individual scent preferences for top, middle, and base notes to personality traits derived from Jungian archetypes, such as the Hero, Caregiver, and Explorer, among others. The algorithm utilizes Bayesian updating to dynamically refine predictions as users interact with each fragrance note. This iterative process allows for the personalization of fragrance experiences based on prior data and personality assessments, leading to adaptive and interpretable recommendations. By combining psychological theory with Bayesian machine learning, this approach addresses the complexity of modeling individual preferences while capturing user-specific and population-level trends. The study highlights the potential of hierarchical Bayesian frameworks in creating customized olfactory experiences, informed by psychological and demographic factors, contributing to advancements in personalized product design and machine learning applications in sensory-based industrie...
Automatic solid form classification in pharmaceutical drug development.
EN: In materials and pharmaceutical development, rapidly and accurately determining the similarity between X-ray powder diffraction (XRPD) measurements is crucial for efficient solid form screening and analysis. We present SMolNet, a classifier based on a Siamese network architecture, designed to automate the comparison of XRPD patterns. Our results show that training SMolNet on loss functions from the self-supervised learning domain yields a substantial boost in performance with respect to class separability and precision, specifically when classifying phases of previously unseen compounds. The application of SMolNet demonstrates significant improvements in screening efficiency across multiple active pharmaceutical ingredients, providing a powerful tool for scientists to discover and categorize measurements with reliable accuracy.
Identifiability analysis of vaccination decision-making dynamics.
EN: Variations in individuals' perceptions of vaccination and decision-making processes can give rise to poor vaccination coverage. The future vaccination promotion programs will benefit from understanding this heterogeneity amongst groups within a population and, accordingly, tailoring the communication strategies. Motivated by this, we developed a mechanistic model consisting of a system of ordinary differential equations that categorizes individuals based on two factors: (i) perceived payoff gains for vaccination and (ii} decision-making strategies where we assumed that individuals may behave as either myopic rationalists, going for a dose of vaccine if doing so maximizes their perceived payoff gain, or success-based learners, waiting to observe feedback on vaccination before deciding. We then investigated the global identifiability of group proportions and perceived payoff gains, that is, the possibility of globally retrieving these parameters by observing the error-free cumulative proportion of vaccinated individuals over time. To do so, for each group, we assumed a piecewise constant payoff gain and, for each time interval, obtained the so-called generalized input-output eq...
MassSpecGym: A benchmark for the discovery and identification of molecules.
EN: The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics a...
MassSpecGym: A benchmark for the discovery and identification of molecules.
EN: The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols. To address this problem, we propose MassSpecGym -- the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics a...
TurboHopp: Accelerated Molecule Scaffold Hopping with Consistency Models.
EN: Navigating the vast chemical space of druggable compounds is a formidable challenge in drug discovery, where generative models are increasingly employed to identify viable candidates. Conditional 3D structure-based drug design (3D-SBDD) models, which take into account complex three-dimensional interactions and molecular geometries, are particularly promising. Scaffold hopping is an efficient strategy that facilitates the identification of similar active compounds by strategically modifying the core structure of molecules, effectively narrowing the wide chemical space and enhancing the discovery of drug-like products. However, the practical application of 3D-SBDD generative models is hampered by their slow processing speeds. To address this bottleneck, we introduce TurboHopp, an accelerated pocket-conditioned 3D scaffold hopping model that merges the strategic effectiveness of traditional scaffold hopping with rapid generation capabilities of consistency models. This synergy not only enhances efficiency but also significantly boosts generation speeds, achieving up to 30 times faster inference speed as well as superior generation quality compared to existing diffusion-based models, e...
$\texttt{PatentAgent}$: Intelligent Agent for Automated Pharmaceutical Patent Analysis.
EN: Pharmaceutical patents play a vital role in biochemical industries, especially in drug discovery, providing researchers with unique early access to data, experimental results, and research insights. With the advancement of machine learning, patent analysis has evolved from manual labor to tasks assisted by automatic tools. However, there still lacks an unified agent that assists every aspect of patent analysis, from patent reading to core chemical identification. Leveraging the capabilities of Large Language Models (LLMs) to understand requests and follow instructions, we introduce the $\textbf{first}$ intelligent agent in this domain, $\texttt{PatentAgent}$, poised to advance and potentially revolutionize the landscape of pharmaceutical research. $\texttt{PatentAgent}$ comprises three key end-to-end modules -- $\textit{PA-QA}$, $\textit{PA-Img2Mol}$, and $\textit{PA-CoreId}$ -- that respectively perform (1) patent question-answering, (2) image-to-molecular-structure conversion, and (3) core chemical structure identification, addressing the essential needs of scientists and practitioners in pharmaceutical patent analysis. Each module of $\texttt{PatentAgent}$ demonstrates significa...
Binding memory of liquid molecules.
EN: Understanding the binding dynamics of liquid molecules is of fundamental importance in physical and life sciences. However, nanoscale fast dynamics pose great challenges for experimental characterization. Conventionally, the binding dynamics have been assumed to be memoryless. Here, we integrate large scale computer simulation, scaling theory, and real-time single particle tracking microscopy with high spatiotemporal precision to unveil a universal memory effect in the binding dynamics of liquid molecules. This binding memory can be quantified by a binding time autocorrelation function, whose power-law decay depends not only on the binding affinity, but also on the topological and materials properties of the surrounding environment. Context-dependent biomolecular binding memory is likely exploited by biological systems to regulate biochemical reactions and biophysical processes. Deciphering this binding memory offers a novel strategy to probe complex biological systems and advanced soft materials.
An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers.
EN: Artificial intelligence is revolutionizing computational chemistry, bringing unprecedented innovation and efficiency to the field. To further advance research and expedite progress, we introduce the Quantum Open Organic Molecular (QO2Mol) database -- a large-scale quantum chemistry dataset designed for professional and transformative research in organic molecular sciences under an open-source license. The database comprises 120,000 organic molecules and approximately 20 million conformers, encompassing 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with heavy atom counts exceeding 40. Utilizing the high-precision B3LYP/def2-SVP quantum mechanical level, each conformation was meticulously computed for quantum mechanical properties, including potential energy and forces. These molecules are derived from fragments of compounds in ChEMBL, ensuring their structural relevance to real-world compounds. Its extensive coverage of molecular structures and diverse elemental composition enables comprehensive studies of structure-property relationships, enhancing the accuracy and applicability of machine learning models in predicting molecular behaviors. The QO2Mol database and benchmar...
An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers.
EN: Artificial intelligence is revolutionizing computational chemistry, bringing unprecedented innovation and efficiency to the field. To further advance research and expedite progress, we introduce the Quantum Open Organic Molecular (QO2Mol) database -- a large-scale quantum chemistry dataset designed for professional and transformative research in organic molecular sciences under an open-source license. The database comprises 120,000 organic molecules and approximately 20 million conformers, encompassing 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with heavy atom counts exceeding 40. Utilizing the high-precision B3LYP/def2-SVP quantum mechanical level, each conformation was meticulously computed for quantum mechanical properties, including potential energy and forces. These molecules are derived from fragments of compounds in ChEMBL, ensuring their structural relevance to real-world compounds. Its extensive coverage of molecular structures and diverse elemental composition enables comprehensive studies of structure-property relationships, enhancing the accuracy and applicability of machine learning models in predicting molecular behaviors. The QO2Mol database and benchmar...
An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers.
EN: Artificial intelligence is revolutionizing computational chemistry, bringing unprecedented innovation and efficiency to the field. To further advance research and expedite progress, we introduce the Quantum Open Organic Molecular (QO2Mol) database -- a large-scale quantum chemistry dataset designed for professional and transformative research in organic molecular sciences under an open-source license. The database comprises 120,000 organic molecules and approximately 20 million conformers, encompassing 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with heavy atom counts exceeding 40. Utilizing the high-precision B3LYP/def2-SVP quantum mechanical level, each conformation was meticulously computed for quantum mechanical properties, including potential energy and forces. These molecules are derived from fragments of compounds in ChEMBL, ensuring their structural relevance to real-world compounds. Its extensive coverage of molecular structures and diverse elemental composition enables comprehensive studies of structure-property relationships, enhancing the accuracy and applicability of machine learning models in predicting molecular behaviors. The QO2Mol database and benchmar...
An Open Quantum Chemistry Property Database of 120 Kilo Molecules with 20 Million Conformers.
EN: Artificial intelligence is revolutionizing computational chemistry, bringing unprecedented innovation and efficiency to the field. To further advance research and expedite progress, we introduce the Quantum Open Organic Molecular (QO2Mol) database -- a large-scale quantum chemistry dataset designed for professional and transformative research in organic molecular sciences under an open-source license. The database comprises 120,000 organic molecules and approximately 20 million conformers, encompassing 10 different elements (C, H, O, N, S, P, F, Cl, Br, I), with heavy atom counts exceeding 40. Utilizing the high-precision B3LYP/def2-SVP quantum mechanical level, each conformation was meticulously computed for quantum mechanical properties, including potential energy and forces. These molecules are derived from fragments of compounds in ChEMBL, ensuring their structural relevance to real-world compounds. Its extensive coverage of molecular structures and diverse elemental composition enables comprehensive studies of structure-property relationships, enhancing the accuracy and applicability of machine learning models in predicting molecular behaviors. The QO2Mol database and benchmar...
Disease Outbreak Detection and Forecasting: A Review of Methods and Data Sources.
EN: Infectious diseases occur when pathogens from other individuals or animals infect a person, resulting in harm to both individuals and society as a whole. The outbreak of such diseases can pose a significant threat to human health. However, early detection and tracking of these outbreaks have the potential to reduce the mortality impact. To address these threats, public health authorities have endeavored to establish comprehensive mechanisms for collecting disease data. Many countries have implemented infectious disease surveillance systems, with the detection of epidemics being a primary objective. The clinical healthcare system, local/state health agencies, federal agencies, academic/professional groups, and collaborating governmental entities all play pivotal roles within this system. Moreover, nowadays, search engines and social media platforms can serve as valuable tools for monitoring disease trends. The Internet and social media have become significant platforms where users share information about their preferences and relationships. This real-time information can be harnessed to gauge the influence of ideas and societal opinions, making it highly useful across various domain...
Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records.
EN: In the rapidly evolving landscape of veterinary healthcare, integrating machine learning (ML) clinical decision-making tools with electronic health records (EHRs) promises to improve diagnostic accuracy and patient care. However, the seamless integration of ML classifiers into existing EHRs in veterinary medicine is frequently hindered by the rigidity of EHR systems or the limited availability of IT resources. To address this shortcoming, we present Anna, a freely-available software solution that provides ML classifier results for EHR laboratory data in real-time.
DeltaDock: A Unified Framework for Accurate, Efficient, and Physically Reliable Molecular Docking.
EN: Molecular docking, a technique for predicting ligand binding poses, is crucial in structure-based drug design for understanding protein-ligand interactions. Recent advancements in docking methods, particularly those leveraging geometric deep learning (GDL), have demonstrated significant efficiency and accuracy advantages over traditional sampling methods. Despite these advancements, current methods are often tailored for specific docking settings, and limitations such as the neglect of protein side-chain structures, difficulties in handling large binding pockets, and challenges in predicting physically valid structures exist. To accommodate various docking settings and achieve accurate, efficient, and physically reliable docking, we propose a novel two-stage docking framework, DeltaDock, consisting of pocket prediction and site-specific docking. We innovatively reframe the pocket prediction task as a pocket-ligand alignment problem rather than direct prediction in the first stage. Then we follow a bi-level coarse-to-fine iterative refinement process to perform site-specific docking. Comprehensive experiments demonstrate the superior performance of DeltaDock. Notably, in the blind d...
Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive Learning.
EN: We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL's ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined ide...
Unified Representation of Genomic and Biomedical Concepts through Multi-Task, Multi-Source Contrastive Learning.
EN: We introduce GENomic Encoding REpresentation with Language Model (GENEREL), a framework designed to bridge genetic and biomedical knowledge bases. What sets GENEREL apart is its ability to fine-tune language models to infuse biological knowledge behind clinical concepts such as diseases and medications. This fine-tuning enables the model to capture complex biomedical relationships more effectively, enriching the understanding of how genomic data connects to clinical outcomes. By constructing a unified embedding space for biomedical concepts and a wide range of common SNPs from sources such as patient-level data, biomedical knowledge graphs, and GWAS summaries, GENEREL aligns the embeddings of SNPs and clinical concepts through multi-task contrastive learning. This allows the model to adapt to diverse natural language representations of biomedical concepts while bypassing the limitations of traditional code mapping systems across different data sources. Our experiments demonstrate GENEREL's ability to effectively capture the nuanced relationships between SNPs and clinical concepts. GENEREL also emerges to discern the degree of relatedness, potentially allowing for a more refined ide...
Survey of Deep Learning and Physics-Based Approaches in Computational Wave Imaging.
EN: Computational wave imaging (CWI) extracts hidden structure and physical properties of a volume of material by analyzing wave signals that traverse that volume. Applications include seismic exploration of the Earth's subsurface, acoustic imaging and non-destructive testing in material science, and ultrasound computed tomography in medicine. Current approaches for solving CWI problems can be divided into two categories: those rooted in traditional physics, and those based on deep learning. Physics-based methods stand out for their ability to provide high-resolution and quantitatively accurate estimates of acoustic properties within the medium. However, they can be computationally intensive and are susceptible to ill-posedness and nonconvexity typical of CWI problems. Machine learning-based computational methods have recently emerged, offering a different perspective to address these challenges. Diverse scientific communities have independently pursued the integration of deep learning in CWI. This review discusses how contemporary scientific machine-learning (ML) techniques, and deep neural networks in particular, have been developed to enhance and integrate with traditional physics-b...
Chemistry-Inspired Diffusion with Non-Differentiable Guidance.
EN: Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors or conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit ...
Chemistry-Inspired Diffusion with Non-Differentiable Guidance.
EN: Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors or conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit ...
Chemistry-Inspired Diffusion with Non-Differentiable Guidance.
EN: Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors or conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit ...
Systematic Literature Review of Vision-Based Approaches to Outdoor Livestock Monitoring with Lessons from Wildlife Studies.
EN: Precision livestock farming (PLF) aims to improve the health and welfare of livestock animals and farming outcomes through the use of advanced technologies. Computer vision, combined with recent advances in machine learning and deep learning artificial intelligence approaches, offers a possible solution to the PLF ideal of 24/7 livestock monitoring that helps facilitate early detection of animal health and welfare issues. However, a significant number of livestock species are raised in large outdoor habitats that pose technological challenges for computer vision approaches. This review provides a comprehensive overview of computer vision methods and open challenges in outdoor animal monitoring. We include research from both the livestock and wildlife fields in the review because of the similarities in appearance, behaviour, and habitat for many livestock and wildlife. We focus on large terrestrial mammals, such as cattle, horses, deer, goats, sheep, koalas, giraffes, and elephants. We use an image processing pipeline to frame our discussion and highlight the current capabilities and open technical challenges at each stage of the pipeline. The review found a clear trend towards the ...
Text-guided Diffusion Model for 3D Molecule Generation.
EN: The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.
Text-guided Diffusion Model for 3D Molecule Generation.
EN: The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.
Text-guided Diffusion Model for 3D Molecule Generation.
EN: The de novo generation of molecules with targeted properties is crucial in biology, chemistry, and drug discovery. Current generative models are limited to using single property values as conditions, struggling with complex customizations described in detailed human language. To address this, we propose the text guidance instead, and introduce TextSMOG, a new Text-guided Small Molecule Generation Approach via 3D Diffusion Model which integrates language and diffusion models for text-guided small molecule generation. This method uses textual conditions to guide molecule generation, enhancing both stability and diversity. Experimental results show TextSMOG's proficiency in capturing and utilizing information from textual descriptions, making it a powerful tool for generating 3D molecular structures in response to complex textual customizations.
Immunogenicity Prediction with Dual Attention Enables Vaccine Target Selection.
EN: Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce VenusVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre-trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 7000 antigen sequences, structures, and immunogenicity labels from bacteria, virus, and tumor. Extensive experiments demonstrate that VenusVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post-hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research. The implementation is at https://github.com/songleee/VenusVaccine.
Explainable Diagnosis Prediction through Neuro-Symbolic Integration.
EN: Diagnosis prediction is a critical task in healthcare, where timely and accurate identification of medical conditions can significantly impact patient outcomes. Traditional machine learning and deep learning models have achieved notable success in this domain but often lack interpretability which is a crucial requirement in clinical settings. In this study, we explore the use of neuro-symbolic methods, specifically Logical Neural Networks (LNNs), to develop explainable models for diagnosis prediction. Essentially, we design and implement LNN-based models that integrate domain-specific knowledge through logical rules with learnable thresholds. Our models, particularly $M_{\text{multi-pathway}}$ and $M_{\text{comprehensive}}$, demonstrate superior performance over traditional models such as Logistic Regression, SVM, and Random Forest, achieving higher accuracy (up to 80.52\%) and AUROC scores (up to 0.8457) in the case study of diabetes prediction. The learned weights and thresholds within the LNN models provide direct insights into feature contributions, enhancing interpretability without compromising predictive power. These findings highlight the potential of neuro-symbolic approac...
Learning Personalized Treatment Decisions in Precision Medicine: Disentangling Treatment Assignment Bias in Counterfactual Outcome Prediction and Biomarker Identification.
EN: Precision medicine has the potential to tailor treatment decisions to individual patients using machine learning (ML) and artificial intelligence (AI), but it faces significant challenges due to complex biases in clinical observational data and the high-dimensional nature of biological data. This study models various types of treatment assignment biases using mutual information and investigates their impact on ML models for counterfactual prediction and biomarker identification. Unlike traditional counterfactual benchmarks that rely on fixed treatment policies, our work focuses on modeling different characteristics of the underlying observational treatment policy in distinct clinical settings. We validate our approach through experiments on toy datasets, semi-synthetic tumor cancer genome atlas (TCGA) data, and real-world biological outcomes from drug and CRISPR screens. By incorporating empirical biological mechanisms, we create a more realistic benchmark that reflects the complexities of real-world data. Our analysis reveals that different biases lead to varying model performances, with some biases, especially those unrelated to outcome mechanisms, having minimal effect on predic...
Assessing interaction recovery of predicted protein-ligand poses.
EN: The field of protein-ligand pose prediction has seen significant advances in recent years, with machine learning-based methods now being commonly used in lieu of classical docking methods or even to predict all-atom protein-ligand complex structures. Most contemporary studies focus on the accuracy and physical plausibility of ligand placement to determine pose quality, often neglecting a direct assessment of the interactions observed with the protein. In this work, we demonstrate that ignoring protein-ligand interaction fingerprints can lead to overestimation of model performance, most notably in recent protein-ligand cofolding models which often fail to recapitulate key interactions.
FlexSBDD: Structure-Based Drug Design with Flexible Protein Modeling.
EN: Structure-based drug design (SBDD), which aims to generate 3D ligand molecules binding to target proteins, is a fundamental task in drug discovery. Existing SBDD methods typically treat protein as rigid and neglect protein structural change when binding with ligand molecules, leading to a big gap with real-world scenarios and inferior generation qualities (e.g., many steric clashes). To bridge the gap, we propose FlexSBDD, a deep generative model capable of accurately modeling the flexible protein-ligand complex structure for ligand molecule generation. FlexSBDD adopts an efficient flow matching framework and leverages E(3)-equivariant network with scalar-vector dual representation to model dynamic structural changes. Moreover, novel data augmentation schemes based on structure relaxation/sidechain repacking are adopted to boost performance. Extensive experiments demonstrate that FlexSBDD achieves state-of-the-art performance in generating high-affinity molecules and effectively modeling the protein's conformation change to increase favorable protein-ligand interactions (e.g., Hydrogen bonds) and decrease steric clashes.
Understanding Clinical Decision-Making in Traditional East Asian Medicine through Dimensionality Reduction: An Empirical Investigation.
EN: This study examines the clinical decision-making processes in Traditional East Asian Medicine (TEAM) by reinterpreting pattern identification (PI) through the lens of dimensionality reduction. Focusing on the Eight Principle Pattern Identification (EPPI) system and utilizing empirical data from the Shang-Han-Lun, we explore the necessity and significance of prioritizing the Exterior-Interior pattern in diagnosis and treatment selection. We test three hypotheses: whether the Ext-Int pattern contains the most information about patient symptoms, represents the most abstract and generalizable symptom information, and facilitates the selection of appropriate herbal prescriptions. Employing quantitative measures such as the abstraction index, cross-conditional generalization performance, and decision tree regression, our results demonstrate that the Exterior-Interior pattern represents the most abstract and generalizable symptom information, contributing to the efficient mapping between symptom and herbal prescription spaces. This research provides an objective framework for understanding the cognitive processes underlying TEAM, bridging traditional medical practices with modern computat...
Quantum Machine Learning in Drug Discovery: Applications in Academia and Pharmaceutical Industries.
EN: The nexus of quantum computing and machine learning - quantum machine learning - offers the potential for significant advancements in chemistry. This review specifically explores the potential of quantum neural networks on gate-based quantum computers within the context of drug discovery. We discuss the theoretical foundations of quantum machine learning, including data encoding, variational quantum circuits, and hybrid quantum-classical approaches. Applications to drug discovery are highlighted, including molecular property prediction and molecular generation. We provide a balanced perspective, emphasizing both the potential benefits and the challenges that must be addressed.
Fully automatic extraction of morphological traits from the Web: utopia or reality?.
EN: Plant morphological traits, their observable characteristics, are fundamental to understand the role played by each species within their ecosystem. However, compiling trait information for even a moderate number of species is a demanding task that may take experts years to accomplish. At the same time, massive amounts of information about species descriptions is available online in the form of text, although the lack of structure makes this source of data impossible to use at scale. To overcome this, we propose to leverage recent advances in large language models (LLMs) and devise a mechanism for gathering and processing information on plant traits in the form of unstructured textual descriptions, without manual curation. We evaluate our approach by automatically replicating three manually created species-trait matrices. Our method managed to find values for over half of all species-trait pairs, with an F1-score of over 75%. Our results suggest that large-scale creation of structured trait databases from unstructured online text is currently feasible thanks to the information extraction capabilities of LLMs, being limited by the availability of textual descriptions covering all the...
Analysis of Gene Regulatory Networks from Gene Expression Using Graph Neural Networks.
EN: Unraveling the complexities of Gene Regulatory Networks (GRNs) is crucial for understanding cellular processes and disease mechanisms. Traditional computational methods often struggle with the dynamic nature of these networks. This study explores the use of Graph Neural Networks (GNNs), a powerful approach for modeling graph-structured data like GRNs. Utilizing a Graph Attention Network v2 (GATv2), our study presents a novel approach to the construction and interrogation of GRNs, informed by gene expression data and Boolean models derived from literature. The model's adeptness in accurately predicting regulatory interactions and pinpointing key regulators is attributed to advanced attention mechanisms, a hallmark of the GNN framework. These insights suggest that GNNs are primed to revolutionize GRN analysis, addressing traditional limitations and offering richer biological insights. The success of GNNs, as highlighted by our model's reliance on high-quality data, calls for enhanced data collection methods to sustain progress. The integration of GNNs in GRN research is set to pioneer developments in personalized medicine, drug discovery, and our grasp of biological systems, bolstere...
The Future of Decoding Non-Standard Nucleotides: Leveraging Nanopore Sequencing for Expanded Genetic Codes.
EN: Expanding genetic codes from natural standard nucleotides to artificial non-standard nucleotides marks a significant advancement in synthetic biology, with profound implications for biotechnology and medicine. Decoding the biological information encoded in these non-standard nucleotides presents new challenges, as traditional sequencing technologies are unable to recognize or interpret novel base pairings. In this perspective, we explore the potential of nanopore sequencing, which is uniquely suited to decipher both standard and non-standard nucleotides by directly measuring the biophysical properties of nucleic acids. Nanopore technology offers real-time, long-read sequencing without the need for amplification or synthesis, making it particularly advantageous for expanded genetic systems like Artificially Expanded Genetic Information Systems (AEGIS). We discuss how the adaptability of nanopore sequencing and advancements in data processing can unlock the potential of these synthetic genomes and open new frontiers in understanding and utilizing expanded genetic codes.
Effective management of white rust disease in red amaranth: a field study in Dhaka, Bangladesh.
EN: This study aimed to evaluate the effective management strategies of Albugo candida, a pathogen of white rust disease in red amaranth (Amaranthus tricolor L.), accountable for the reduction of seed production. The study was performed during the Rabi season of 2018 and the Kharif season of 2019 at Sher-e-Bangla Agricultural University in Bangladesh. Eight treatments, including chemical, botanical, and biopesticide treatments such as Ridomil Gold 68 WG, Autostin 50 WP, Dithane M 45, Goldton 50 WP, the Bordeaux mixture, G-Derma, Garlic bulb extract, and Allamanda leaf extract, were evaluated. Four foliar sprays were applied at seven-day intervals after disease symptom onset. The field experiments followed a randomized complete block design with three replications. A microscopic study confirmed that Albugo candida was the causal organism. In both seasons, Ridomil Gold demonstrated superior efficacy in reducing disease incidence in plants, disease incidence in leaves, and disease severity, which were 63.07%, 62.78.5, and 84.31%, respectively, in Rabi and 69.73%, 65.71%, and 88.41%, respectively, in the Kharif season. Allamanda leaf extract also had statistically similar results, while Au...
The microbiome science of composting and human excrement composting: a review.
EN: Linear waste management systems are unsustainable and contribute to environmental degradation, economic inequity, and health disparities. Among the array of environmental challenges stemming from anthropogenic impacts, the management of human excrement (human feces and urine) stands as a significant concern. Over two billion people do not have access to adequate sanitation resulting in a global public health crisis. Composting is the microbial biotechnology aimed at cycling organic waste, including human excrement, for improved public health, agricultural productivity and safety, and environmental sustainability. Applications of modern microbiome-omics and related technologies have vast capacity to support continued advances in composting science and praxis. In this article, we review literature focused on applications of microbiome technologies to study composting systems and reactions. The studies we survey generally fall into the categories of animal manure composting, food and landscaping waste composting, biosolids composting, and human excrement composting. We review experiments utilizing microbiome technologies to investigate strategies for enhancing pathogen suppression a...
PatchAlign:Fair and Accurate Skin Disease Image Classification by Alignment with Clinical Labels.
EN: Deep learning models have achieved great success in automating skin lesion diagnosis. However, the ethnic disparity in these models' predictions needs to be addressed before deploying them. We introduce a novel approach, PatchAlign, to enhance skin condition image classification accuracy and fairness by aligning with clinical text representations of skin conditions. PatchAlign uses Graph Optimal Transport (GOT) Loss as a regularizer to perform cross-domain alignment. The representations obtained are robust and generalize well across skin tones, even with limited training samples. To reduce the effect of noise and artifacts in clinical dermatology images, we propose a learnable Masked Graph Optimal Transport for cross-domain alignment that further improves fairness metrics. We compare our model to the state-of-the-art FairDisCo on two skin lesion datasets with different skin types: Fitzpatrick17k and Diverse Dermatology Images (DDI). PatchAlign enhances the accuracy of skin condition image classification by 2.8% (in-domain) and 6.2% (out-domain) on Fitzpatrick17k, and 4.2% (in-domain) on DDI compared to FairDisCo. Additionally, it consistently improves the fairness of true positiv...
Large Language Models-Enabled Digital Twins for Precision Medicine in Rare Gynecological Tumors.
EN: Rare gynecological tumors (RGTs) present major clinical challenges due to their low incidence and heterogeneity. The lack of clear guidelines leads to suboptimal management and poor prognosis. Molecular tumor boards accelerate access to effective therapies by tailoring treatment based on biomarkers, beyond cancer type. Unstructured data that requires manual curation hinders efficient use of biomarker profiling for therapy matching. This study explores the use of large language models (LLMs) to construct digital twins for precision medicine in RGTs. Our proof-of-concept digital twin system integrates clinical and biomarker data from institutional and published cases (n=21) and literature-derived data (n=655 publications with n=404,265 patients) to create tailored treatment plans for metastatic uterine carcinosarcoma, identifying options potentially missed by traditional, single-source analysis. LLM-enabled digital twins efficiently model individual patient trajectories. Shifting to a biology-based rather than organ-based tumor definition enables personalized care that could advance RGT management and thus enhance patient outcomes.
LLaVA-Chef: A Multi-modal Generative Model for Food Recipes.
EN: In the rapidly evolving landscape of online recipe sharing within a globalized context, there has been a notable surge in research towards comprehending and generating food recipes. Recent advancements in large language models (LLMs) like GPT-2 and LLaVA have paved the way for Natural Language Processing (NLP) approaches to delve deeper into various facets of food-related tasks, encompassing ingredient recognition and comprehensive recipe generation. Despite impressive performance and multi-modal adaptability of LLMs, domain-specific training remains paramount for their effective application. This work evaluates existing LLMs for recipe generation and proposes LLaVA-Chef, a novel model trained on a curated dataset of diverse recipe prompts in a multi-stage approach. First, we refine the mapping of visual food image embeddings to the language space. Second, we adapt LLaVA to the food domain by fine-tuning it on relevant recipe data. Third, we utilize diverse prompts to enhance the model's recipe comprehension. Finally, we improve the linguistic quality of generated recipes by penalizing the model with a custom loss function. LLaVA-Chef demonstrates impressive improvements over pretr...
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction.
EN: Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the lim...
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction.
EN: Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extraction. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs' task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs - including BioMistral and Llama-2 models - on a diverse set of biomedical datasets, using standard prompting, Chain of-Thought (CoT) and Self Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the lim...
One-step Structure Prediction and Screening for Protein-Ligand Complexes using Multi-Task Geometric Deep Learning.
EN: Understanding the structure of the protein-ligand complex is crucial to drug development. Existing virtual structure measurement and screening methods are dominated by docking and its derived methods combined with deep learning. However, the sampling and scoring methodology have largely restricted the accuracy and efficiency. Here, we show that these two fundamental tasks can be accurately tackled with a single model, namely LigPose, based on multi-task geometric deep learning. By representing the ligand and the protein pair as a graph, LigPose directly optimizes the three-dimensional structure of the complex, with the learning of binding strength and atomic interactions as auxiliary tasks, enabling its one-step prediction ability without docking tools. Extensive experiments show LigPose achieved state-of-the-art performance on major tasks in drug research. Its considerable improvements indicate a promising paradigm of AI-based pipeline for drug development.
Novel stabilization mechanisms for concentrated emulsions with tunable morphology via amphiphilic polymer-grafted nanoparticles.
EN: This study explores the stabilization mechanisms of concentrated emulsions with tunable morphology using amphiphilic polymer-grafted nanoparticles (PGNPs). We employ coarse-grained molecular simulations to investigate concentrated oil-in-water emulsions stabilized by partially hydrolyzed poly(vinyl alcohol)-grafted poly(methyl methacrylate) (PMMA) particles. Two grafting architectures were examined: hydrophilic-hydrophobic (AB-type) diblock PGNPs and reverse BA-type diblock PGNPs. Our findings reveal that AB-type diblock PGNPs tend to aggregate, leading to droplet-droplet coalescence. In contrast, BA-type diblock PGNPs disperse effectively in the water phase, stabilizing robust emulsion through a space-filling mechanism. The study further demonstrates that the stability and morphology of the emulsions can be tuned by varying the number of PGNPs. Our results suggest that BA-type diblock PGNPs are more effective in stabilizing concentrated emulsions, offering insights for the design of novel emulsifiers in industrial applications.
Facial Wrinkle Segmentation for Cosmetic Dermatology: Pretraining with Texture Map-Based Weak Supervision.
EN: Facial wrinkle detection plays a crucial role in cosmetic dermatology. Precise manual segmentation of facial wrinkles is challenging and time-consuming, with inherent subjectivity leading to inconsistent results among graders. To address this issue, we propose two solutions. First, we build and release the first public facial wrinkle dataset, 'FFHQ-Wrinkle', an extension of the NVIDIA FFHQ dataset. It includes 1,000 images with human labels and 50,000 images with automatically generated weak labels. This dataset could serve as a foundation for the research community to develop advanced wrinkle detection algorithms. Second, we introduce a simple training strategy utilizing texture maps, applicable to various segmentation models, to detect wrinkles across the face. Our two-stage training strategy first pretrain models on a large dataset with weak labels (N=50k), or masked texture maps generated through computer vision techniques, without human intervention. We then finetune the models using human-labeled data (N=1k), which consists of manually labeled wrinkle masks. The network takes as input a combination of RGB and masked texture map of the image, comprising four channels, in finet...
Fragment and Geometry Aware Tokenization of Molecules for Structure-Based Drug Design Using Language Models.
EN: Structure-based drug design (SBDD) is crucial for developing specific and effective therapeutics against protein targets but remains challenging due to complex protein-ligand interactions and vast chemical space. Although language models (LMs) have excelled in natural language processing, their application in SBDD is underexplored. To bridge this gap, we introduce a method, known as Frag2Seq, to apply LMs to SBDD by generating molecules in a fragment-based manner in which fragments correspond to functional modules. We transform 3D molecules into fragment-informed sequences using SE(3)-equivariant molecule and fragment local frames, extracting SE(3)-invariant sequences that preserve geometric information of 3D fragments. Furthermore, we incorporate protein pocket embeddings obtained from a pre-trained inverse folding model into the LMs via cross-attention to capture protein-ligand interaction, enabling effective target-aware molecule generation. Benefiting from employing LMs with fragment-based generation and effective protein context encoding, our model achieves the best performance on binding vina score and chemical properties such as QED and Lipinski, which shows our model's effi...
Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents.
EN: A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $ω$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic so...
Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents.
EN: A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $ω$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic so...
Hessian QM9: A quantum chemistry database of molecular Hessians in implicit solvents.
EN: A significant challenge in computational chemistry is developing approximations that accelerate \emph{ab initio} methods while preserving accuracy. Machine learning interatomic potentials (MLIPs) have emerged as a promising solution for constructing atomistic potentials that can be transferred across different molecular and crystalline systems. Most MLIPs are trained only on energies and forces in vacuum, while an improved description of the potential energy surface could be achieved by including the curvature of the potential energy surface. We present Hessian QM9, the first database of equilibrium configurations and numerical Hessian matrices, consisting of 41,645 molecules from the QM9 dataset at the $ω$B97x/6-31G* level. Molecular Hessians were calculated in vacuum, as well as water, tetrahydrofuran, and toluene using an implicit solvation model. To demonstrate the utility of this dataset, we show that incorporating second derivatives of the potential energy surface into the loss function of a MLIP significantly improves the prediction of vibrational frequencies in all solvent environments, thus making this dataset extremely useful for studying organic molecules in realistic so...
Drug Discovery SMILES-to-Pharmacokinetics Diffusion Models with Deep Molecular Understanding.
EN: Artificial intelligence (AI) is increasingly used in every stage of drug development. One challenge facing drug discovery AI is that drug pharmacokinetic (PK) datasets are often collected independently from each other, often with limited overlap, creating data overlap sparsity. Data sparsity makes data curation difficult for researchers looking to answer research questions in poly-pharmacy, drug combination research, and high-throughput screening. We propose Imagand, a novel SMILES-to-Pharmacokinetic (S2PK) diffusion model capable of generating an array of PK target properties conditioned on SMILES inputs. We show that Imagand-generated synthetic PK data closely resembles real data univariate and bivariate distributions, and improves performance for downstream tasks. Imagand is a promising solution for data overlap sparsity and allows researchers to efficiently generate ligand PK data for drug discovery research. Code is available at https://github.com/bing1100/Imagand.
Generalized knowledge-enhanced framework for biomedical entity and relation extraction.
EN: In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding in...
Generalized knowledge-enhanced framework for biomedical entity and relation extraction.
EN: In recent years, there has been an increasing number of frameworks developed for biomedical entity and relation extraction. This research effort aims to address the accelerating growth in biomedical publications and the intricate nature of biomedical texts, which are written for mainly domain experts. To handle these challenges, we develop a novel framework that utilizes external knowledge to construct a task-independent and reusable background knowledge graph for biomedical entity and relation extraction. The design of our model is inspired by how humans learn domain-specific topics. In particular, humans often first acquire the most basic and common knowledge regarding a field to build the foundational knowledge and then use that as a basis for extending to various specialized topics. Our framework employs such common-knowledge-sharing mechanism to build a general neural-network knowledge graph that is learning transferable to different domain-specific biomedical texts effectively. Experimental evaluations demonstrate that our model, equipped with this generalized and cross-transferable knowledge base, achieves competitive performance benchmarks, including BioRelEx for binding in...
Open-Source Molecular Processing Pipeline for Generating Molecules.
EN: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].
Open-Source Molecular Processing Pipeline for Generating Molecules.
EN: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].
Open-Source Molecular Processing Pipeline for Generating Molecules.
EN: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].
Open-Source Molecular Processing Pipeline for Generating Molecules.
EN: Generative models for molecules have shown considerable promise for use in computational chemistry, but remain difficult to use for non-experts. For this reason, we introduce open-source infrastructure for easily building generative molecular models into the widely used DeepChem [Ramsundar et al., 2019] library with the aim of creating a robust and reusable molecular generation pipeline. In particular, we add high quality PyTorch [Paszke et al., 2019] implementations of the Molecular Generative Adversarial Networks (MolGAN) [Cao and Kipf, 2022] and Normalizing Flows [Papamakarios et al., 2021]. Our implementations show strong performance comparable with past work [Kuznetsov and Polykovskiy, 2021, Cao and Kipf, 2022].
What Ails Generative Structure-based Drug Design: Expressivity is Too Little or Too Much?.
EN: Several generative models with elaborate training and sampling procedures have been proposed to accelerate structure-based drug design (SBDD); however, their empirical performance turns out to be suboptimal. We seek to better understand this phenomenon from both theoretical and empirical perspectives. Since most of these models apply graph neural networks (GNNs), one may suspect that they inherit the representational limitations of GNNs. We analyze this aspect, establishing the first such results for protein-ligand complexes. A plausible counterview may attribute the underperformance of these models to their excessive parameterizations, inducing expressivity at the expense of generalization. We investigate this possibility with a simple metric-aware approach that learns an economical surrogate for affinity to infer an unlabelled molecular graph and optimizes for labels conditioned on this graph and molecular properties. The resulting model achieves state-of-the-art results using 100x fewer trainable parameters and affords up to 1000x speedup. Collectively, our findings underscore the need to reassess and redirect the existing paradigm and efforts for SBDD. Code is available at http...
SMILES-Mamba: Chemical Mamba Foundation Models for Drug ADMET Prediction.
EN: In drug discovery, predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small-molecule drugs is critical for ensuring safety and efficacy. However, the process of accurately predicting these properties is often resource-intensive and requires extensive experimental data. To address this challenge, we propose SMILES-Mamba, a two-stage model that leverages both unlabeled and labeled data through a combination of self-supervised pretraining and fine-tuning strategies. The model first pre-trains on a large corpus of unlabeled SMILES strings to capture the underlying chemical structure and relationships, before being fine-tuned on smaller, labeled datasets specific to ADMET tasks. Our results demonstrate that SMILES-Mamba exhibits competitive performance across 22 ADMET datasets, achieving the highest score in 14 tasks, highlighting the potential of self-supervised learning in improving molecular property prediction. This approach not only enhances prediction accuracy but also reduces the dependence on large, labeled datasets, offering a promising direction for future research in drug discovery.
Hybrid lunar ISRU plant: a comparative analysis with carbothermal reduction and water extraction.
EN: To establish a self-sustained human presence in space and to explore deeper into the solar system, extensive research has been conducted on In-Situ Resource Utilization (ISRU) systems. Past studies have proposed and researched many technologies to produce oxygen from regolith, such as carbothermal reduction and water extraction from icy regolith, to utilize it for astronauts' life support and as the propellant of space systems. However, determining the most promising technology remains challenging due to uncertainties in the lunar environment and processing methods. To better understand the lunar environment and ISRU operations, it is crucial to gather more information. Motivated by this need for information gathering, this paper proposes a new ISRU plant architecture integrating carbothermal reduction of dry regolith and water extraction from icy regolith. Two different hybrid plant architectures integrating both technologies (1) in parallel and (2) in series are examined. The former involves mining and processing in both a Permanently Shadowed Region (PSR) and a peak of eternal light in parallel, while the latter solely mines in a PSR. In this series hybrid architecture, the dry ...
Detection of antifreeze molecule ethylene glycol in the hot molecular core G358.93$-$0.03 MM1.
EN: The identification of complex prebiotic molecules using millimeter and submillimeter telescopes allows us to understand how the basic building blocks of life are formed in the universe. In the interstellar medium (ISM), ethylene glycol ((CH${2}$OH)${2}$) is the simplest sugar alcohol molecule, and it is the reduced alcohol of the simplest sugar-like molecule, glycolaldehyde (CH${2}$OHCHO). We present the first detection of the rotational emission lines of $aGg^{\prime}$ conformer of ethylene glycol ((CH${2}$OH)${2}$) towards the hot molecular core G358.93$-$0.03 MM1 using the Atacama Large Millimeter/Submillimeter Array (ALMA). The estimated column density of $aGg^{\prime}$-(CH${2}$OH)${2}$ towards the G358.93$-$0.03 MM1 is (4.5$\pm$0.1)$\times$10$^{16}$ cm$^{-2}$ with an excitation temperature of 155$\pm$35 K. The abundance of $aGg^{\prime}$-(CH${2}$OH)${2}$ with respect to H${2}$ is (1.4$\pm$0.5)$\times$10$^{-8}$. Similarly, the abundances of $aGg^{\prime}$-(CH${2}$OH)${2}$ with respect to CH${2}$OHCHO and CH${3}$OH are 3.1$\pm$0.5 and (6.1$\pm$0.3)$\times$10$^{-3}$. We compare the estimated abundance of $aGg^{\prime}$-(CH${2}$OH)${2}$ with the existing three-phas...
Plant and insect proteins support optimal bone growth and development; Evidences from a pre-clinical model.
EN: By 2050, the global population will exceed 9 billion, demanding a 70% increase in food production. Animal proteins alone may not suffice and contribute to global warming. Alternative proteins such as legumes, algae, and insects are being explored, but their health impacts are largely unknown. For this, three-week-old rats were fed diets containing 20% protein from various sources for six weeks. A casein-based control diet was compared to soy isolate, spirulina powder, chickpea isolate, chickpea flour, and fly larvae powder. Except for spirulina, alternative protein groups showed comparable growth patterns to the casein group. Morphological and mechanical tests of femur bones matched growth patterns. Caecal 16S analysis highlighted the impact on gut microbiota diversity. Chickpea flour showed significantly lower $α$-diversity compared with casein and chickpea isolate groups while chickpea flour, had the greatest distinction in $β$-diversity. Alternative protein sources supported optimal growth, but quality and health implications require further exploration.
Artificial Intelligence Enhanced Digital Nucleic Acid Amplification Testing for Precision Medicine and Molecular Diagnostics.
EN: The precise quantification of nucleic acids is pivotal in molecular biology, underscored by the rising prominence of nucleic acid amplification tests (NAAT) in diagnosing infectious diseases and conducting genomic studies. This review examines recent advancements in digital Polymerase Chain Reaction (dPCR) and digital Loop-mediated Isothermal Amplification (dLAMP), which surpass the limitations of traditional NAAT by offering absolute quantification and enhanced sensitivity. In this review, we summarize the compelling advancements of dNNAT in addressing pressing public health issues, especially during the COVID-19 pandemic. Further, we explore the transformative role of artificial intelligence (AI) in enhancing dNAAT image analysis, which not only improves efficiency and accuracy but also addresses traditional constraints related to cost, complexity, and data interpretation. In encompassing the state-of-the-art (SOTA) development and potential of both software and hardware, the all-encompassing Point-of-Care Testing (POCT) systems cast new light on benefits including higher throughput, label-free detection, and expanded multiplex analyses. While acknowledging the enhancement of AI-...
Constructing the CORD-19 Vaccine Dataset.
EN: We introduce new dataset 'CORD-19-Vaccination' to cater to scientists specifically looking into COVID-19 vaccine-related research. This dataset is extracted from CORD-19 dataset [Wang et al., 2020] and augmented with new columns for language detail, author demography, keywords, and topic per paper. Facebook's fastText model is used to identify languages [Joulin et al., 2016]. To establish author demography (author affiliation, lab/institution location, and lab/institution country columns) we processed the JSON file for each paper and then further enhanced using Google's search API to determine country values. 'Yake' was used to extract keywords from the title, abstract, and body of each paper and the LDA (Latent Dirichlet Allocation) algorithm was used to add topic information [Campos et al., 2020, 2018a,b]. To evaluate the dataset, we demonstrate a question-answering task like the one used in the CORD-19 Kaggle challenge [Goldbloom et al., 2022]. For further evaluation, sequential sentence classification was performed on each paper's abstract using the model from Dernoncourt et al. [2016]. We partially hand annotated the training dataset and used a pre-trained BERT-PubMed layer. '...
Chemistry-informed Machine Learning Explains Calcium-binding Proteins Fuzzy Shape for Communicating Changes in the Atomic States of Calcium Ions.
EN: Proteins' fuzziness are features for communicating changes in cell signaling instigated by binding with secondary messengers, such as calcium ions, associated with the coordination of muscle contraction, neurotransmitter release, and gene expression. Binding with the disordered parts of a protein, calcium ions must balance their charge states with the shape of calcium-binding proteins and their versatile pool of partners depending on the circumstances they transmit, but it is unclear whether the limited experimental data available can be used to train models to accurately predict the charges of calcium-binding protein variants. Here, we developed a chemistry-informed, machine-learning algorithm that implements a game theoretic approach to explain the output of a machine-learning model without the prerequisite of an excessively large database for high-performance prediction of atomic charges. We used the ab initio electronic structure data representing calcium ions and the structures of the disordered segments of calcium-binding peptides with surrounding water molecules to train several explainable models. Network theory was used to extract the topological features of atomic interac...
Decomposed Direct Preference Optimization for Structure-Based Drug Design.
EN: Diffusion models have achieved promising results for Structure-Based Drug Design (SBDD). Nevertheless, high-quality protein subpocket and ligand data are relatively scarce, which hinders the models' generation capabilities. Recently, Direct Preference Optimization (DPO) has emerged as a pivotal tool for aligning generative models with human preferences. In this paper, we propose DecompDPO, a structure-based optimization method aligns diffusion models with pharmaceutical needs using multi-granularity preference pairs. DecompDPO introduces decomposition into the optimization objectives and obtains preference pairs at the molecule or decomposed substructure level based on each objective's decomposability. Additionally, DecompDPO introduces a physics-informed energy term to ensure reasonable molecular conformations in the optimization results. Notably, DecompDPO can be effectively used for two main purposes: (1) fine-tuning pretrained diffusion models for molecule generation across various protein families, and (2) molecular optimization given a specific protein subpocket after generation. Extensive experiments on the CrossDocked2020 benchmark show that DecompDPO significantly improves...
Modeling drop deformations and rheology of dilute to dense emulsions.
EN: We highlight the current state-of-the-art in modeling emulsion rheology, ranging from dilute to jammed dense systems. We focus on analytical and numerical methods developed for calculating, computing, and tracking drop deformation en route to developing constitutive models for flowing emulsions. We identify material properties and dimensionless parameters, collate the small deformation theories and resulting expressions for viscometric quantities, list theoretical and numerical methods, and take stock of challenges for capturing connections between drop deformation, morphology, and rheology of emulsions. We highlight the substantial progress in providing quantitative descriptions of the rheological response using analytical theories, dimensional analysis, and powerful computational fluid dynamics to determine how macroscopic rheological properties emerge from microscopic features, including deformation and dynamics of non-interacting or interacting drops and molecular aspects that control the interfacial properties.
Accelerating Drug Safety Assessment using Bidirectional-LSTM for SMILES Data.
EN: Computational methods are useful in accelerating the pace of drug discovery. Drug discovery carries several steps such as target identification and validation, lead discovery, and lead optimisation etc., In the phase of lead optimisation, the absorption, distribution, metabolism, excretion, and toxicity properties of lead compounds are assessed. To address the issue of predicting toxicity and solubility in the lead compounds, represented in Simplified Molecular Input Line Entry System (SMILES) notation. Among the different approaches that work on SMILES data, the proposed model was built using a sequence-based approach. The proposed Bi-Directional Long Short Term Memory (BiLSTM) is a variant of Recurrent Neural Network (RNN) that processes input molecular sequences for the comprehensive examination of the structural features of molecules from both forward and backward directions. The proposed work aims to understand the sequential patterns encoded in the SMILES strings, which are then utilised for predicting the toxicity of the molecules. The proposed model on the ClinTox dataset surpasses previous approaches such as Trimnet and Pre-training Graph neural networks(GNN) by achieving ...
Multi-Label Plant Species Classification with Self-Supervised Vision Transformers.
EN: We present a transfer learning approach using a self-supervised Vision Transformer (DINOv2) for the PlantCLEF 2024 competition, focusing on the multi-label plant species classification. Our method leverages both base and fine-tuned DINOv2 models to extract generalized feature embeddings. We train classifiers to predict multiple plant species within a single image using these rich embeddings. To address the computational challenges of the large-scale dataset, we employ Spark for distributed data processing, ensuring efficient memory management and processing across a cluster of workers. Our data processing pipeline transforms images into grids of tiles, classifying each tile, and aggregating these predictions into a consolidated set of probabilities. Our results demonstrate the efficacy of combining transfer learning with advanced data processing techniques for multi-label image classification tasks. Our code is available at https://github.com/dsgt-kaggle-clef/plantclef-2024.
Assessing Cardiomegaly in Dogs Using a Simple CNN Model.
EN: This paper introduces DogHeart, a dataset comprising 1400 training, 200 validation, and 400 test images categorized as small, normal, and large based on VHS score. A custom CNN model is developed, featuring a straightforward architecture with 4 convolutional layers and 4 fully connected layers. Despite the absence of data augmentation, the model achieves a 72\% accuracy in classifying cardiomegaly severity. The study contributes to automated assessment of cardiac conditions in dogs, highlighting the potential for early detection and intervention in veterinary care.
Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters.
EN: Three-dimensional (3D) deep molecular generative models offer the advantage of goal-directed generation based on 3D-dependent properties, such as binding affinity for structure-based design within binding pockets. Traditional benchmarks created to evaluate SMILES or molecular graphs generators, such as GuacaMol or MOSES, are limited to evaluate 3D generators as they do not assess the quality of the generated molecular conformation. In this work, we hence developed GenBench3D, which implements a new benchmark for models producing molecules within a binding pocket. Our main contribution is the Validity3D metric, evaluating the conformation quality using the likelihood of bond lengths and valence angles based on reference values observed in the Cambridge Structural Database. The LiGAN, 3D-SBDD, Pocket2Mol, TargetDiff, DiffSBDD and ResGen models were benchmarked. We show that only between 0% and 11% of generated molecules have valid conformations. Performing local relaxation of generated molecules in the pocket considerably improved the Validity3D for all models by a minimum increase of 40%. For LiGAN, 3D-SBDD, or TargetDiff, the set of valid relaxed molecules shows on average higher V...
Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry.
EN: Spectroscopic techniques are essential tools for determining the structure of molecules. Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared spectroscopy, and Mass Spectrometry, provide insight into the molecular structure, including the presence or absence of functional groups. Chemists leverage the complementary nature of the different methods to their advantage. However, the lack of a comprehensive multimodal dataset, containing spectra from a variety of spectroscopic techniques, has limited machine-learning approaches mostly to single-modality tasks for predicting molecular structures from spectra. Here we introduce a dataset comprising simulated $^1$H-NMR, $^{13}$C-NMR, HSQC-NMR, Infrared, and Mass spectra (positive and negative ion modes) for 790k molecules extracted from chemical reactions in patent data. This dataset enables the development of foundation models for integrating information from multiple spectroscopic modalities, emulating the approach employed by human experts. Additionally, we provide benchmarks for evaluating single-modality tasks such as structure elucidation, predicting the spectra for a target molecule, and functional ...
Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry.
EN: Spectroscopic techniques are essential tools for determining the structure of molecules. Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared spectroscopy, and Mass Spectrometry, provide insight into the molecular structure, including the presence or absence of functional groups. Chemists leverage the complementary nature of the different methods to their advantage. However, the lack of a comprehensive multimodal dataset, containing spectra from a variety of spectroscopic techniques, has limited machine-learning approaches mostly to single-modality tasks for predicting molecular structures from spectra. Here we introduce a dataset comprising simulated $^1$H-NMR, $^{13}$C-NMR, HSQC-NMR, Infrared, and Mass spectra (positive and negative ion modes) for 790k molecules extracted from chemical reactions in patent data. This dataset enables the development of foundation models for integrating information from multiple spectroscopic modalities, emulating the approach employed by human experts. Additionally, we provide benchmarks for evaluating single-modality tasks such as structure elucidation, predicting the spectra for a target molecule, and functional ...
Unraveling Molecular Structure: A Multimodal Spectroscopic Dataset for Chemistry.
EN: Spectroscopic techniques are essential tools for determining the structure of molecules. Different spectroscopic techniques, such as Nuclear magnetic resonance (NMR), Infrared spectroscopy, and Mass Spectrometry, provide insight into the molecular structure, including the presence or absence of functional groups. Chemists leverage the complementary nature of the different methods to their advantage. However, the lack of a comprehensive multimodal dataset, containing spectra from a variety of spectroscopic techniques, has limited machine-learning approaches mostly to single-modality tasks for predicting molecular structures from spectra. Here we introduce a dataset comprising simulated $^1$H-NMR, $^{13}$C-NMR, HSQC-NMR, Infrared, and Mass spectra (positive and negative ion modes) for 790k molecules extracted from chemical reactions in patent data. This dataset enables the development of foundation models for integrating information from multiple spectroscopic modalities, emulating the approach employed by human experts. Additionally, we provide benchmarks for evaluating single-modality tasks such as structure elucidation, predicting the spectra for a target molecule, and functional ...
Molecular Diffusion Models with Virtual Receptors.
EN: Machine learning approaches to Structure-Based Drug Design (SBDD) have proven quite fertile over the last few years. In particular, diffusion-based approaches to SBDD have shown great promise. We present a technique which expands on this diffusion approach in two crucial ways. First, we address the size disparity between the drug molecule and the target/receptor, which makes learning more challenging and inference slower. We do so through the notion of a Virtual Receptor, which is a compressed version of the receptor; it is learned so as to preserve key aspects of the structural information of the original receptor, while respecting the relevant group equivariance. Second, we incorporate a protein language embedding used originally in the context of protein folding. We experimentally demonstrate the contributions of both the virtual receptors and the protein embeddings: in practice, they lead to both better performance, as well as significantly faster computations.
General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design.
EN: Structure-Based Drug Design (SBDD) focuses on generating valid ligands that strongly and specifically bind to a designated protein pocket. Several methods use machine learning for SBDD to generate these ligands in 3D space, conditioned on the structure of a desired protein pocket. Recently, diffusion models have shown success here by modeling the underlying distributions of atomic positions and types. While these methods are effective in considering the structural details of the protein pocket, they often fail to explicitly consider the binding affinity. Binding affinity characterizes how tightly the ligand binds to the protein pocket, and is measured by the change in free energy associated with the binding process. It is one of the most crucial metrics for benchmarking the effectiveness of the interaction between a ligand and protein pocket. To address this, we propose BADGER: Binding Affinity Diffusion Guidance with Enhanced Refinement. BADGER is a general guidance method to steer the diffusion sampling process towards improved protein-ligand binding, allowing us to adjust the distribution of the binding affinity between ligands and proteins. Our method is enabled by using a neur...
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials.
EN: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for tra...
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials.
EN: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for tra...
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials.
EN: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for tra...
$\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials.
EN: Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($ω$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for tra...
Fish Tracking, Counting, and Behaviour Analysis in Digital Aquaculture: A Comprehensive Survey.
EN: Digital aquaculture leverages advanced technologies and data-driven methods, providing substantial benefits over traditional aquaculture practices. This paper presents a comprehensive review of three interconnected digital aquaculture tasks, namely, fish tracking, counting, and behaviour analysis, using a novel and unified approach. Unlike previous reviews which focused on single modalities or individual tasks, we analyse vision-based (i.e. image- and video-based), acoustic-based, and biosensor-based methods across all three tasks. We examine their advantages, limitations, and applications, highlighting recent advancements and identifying critical cross-cutting research gaps. The review also includes emerging ideas such as applying multi-task learning and large language models to address various aspects of fish monitoring, an approach not previously explored in aquaculture literature. We identify the major obstacles hindering research progress in this field, including the scarcity of comprehensive fish datasets and the lack of unified evaluation standards. To overcome the current limitations, we explore the potential of using emerging technologies such as multimodal data fusion and...
Low temperature formation of pyridine and (iso)quinoline via neutral neutral reactions.
EN: Aromatic molecules represent fundamental building blocks in prebiotic chemistry and are contemplated as vital precursors to DNA and RNA nitrogen bases. However, despite the identification of some 300 molecules in extraterrestrial environments, the pathways to pyridine (C5H5N), pyridinyl (C5H4N), and (iso)quinoline (C9H7N) the simplest representative of mono and bicyclic aromatic molecule carrying nitrogen are elusive. Here, we afford compelling evidence on the gas phase formation of methylene amidogen (H2CN) and cyanomethyl (H2CCN) radicals via molecular beam studies and electronic structure calculations. The modeling of the chemistries of Taurus Molecular Cloud (TMC 1) and Titans atmosphere contemplates a complex chain of reactions synthesizing pyridine, pyridinyl, and (iso)quinoline from H2CN and H2CCN at levels of up to 75%. This study affords unique entry points to precursors of DNA and RNA nitrogen bases in hydrocarbon rich extraterrestrial environments thus changing the way we think about the origin of prebiotic molecules in our Galaxy.
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes.
EN: Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multiple molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO(Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes.
EN: Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encourage the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multiple molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study introduces PRESTO(Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves multimodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.
Skin Cancer Images Classification using Transfer Learning Techniques.
EN: Skin cancer is one of the most common and deadliest types of cancer. Early diagnosis of skin cancer at a benign stage is critical to reducing cancer mortality. To detect skin cancer at an earlier stage an automated system is compulsory that can save the life of many patients. Many previous studies have addressed the problem of skin cancer diagnosis using various deep learning and transfer learning models. However, existing literature has limitations in its accuracy and time-consuming procedure. In this work, we applied five different pre-trained transfer learning approaches for binary classification of skin cancer detection at benign and malignant stages. To increase the accuracy of these models we fine-tune different layers and activation functions. We used a publicly available ISIC dataset to evaluate transfer learning approaches. For model stability, data augmentation techniques are applied to improve the randomness of the input dataset. These approaches are evaluated using different hyperparameters such as batch sizes, epochs, and optimizers. The experimental results show that the ResNet-50 model provides an accuracy of 0.935, F1-score of 0.86, and precision of 0.94.
Geometric-informed GFlowNets for Structure-Based Drug Design.
EN: The rise of cost involved with drug discovery and current speed of which they are discover, underscore the need for more efficient structure-based drug design (SBDD) methods. We employ Generative Flow Networks (GFlowNets), to effectively explore the vast combinatorial space of drug-like molecules, which traditional virtual screening methods fail to cover. We introduce a novel modification to the GFlowNet framework by incorporating trigonometrically consistent embeddings, previously utilized in tasks involving protein conformation and protein-ligand interactions, to enhance the model's ability to generate molecules tailored to specific protein pockets. We have modified the existing protein conditioning used by GFlowNets, blending geometric information from both protein and ligand embeddings to achieve more geometrically consistent embeddings. Experiments conducted using CrossDocked2020 demonstrated an improvement in the binding affinity between generated molecules and protein pockets for both single and multi-objective tasks, compared to previous work. Additionally, we propose future work aimed at further increasing the geometric information captured in protein-ligand interactions.
CBGBench: Fill in the Blank of Protein-Molecule Complex Binding Graph.
EN: Structure-based drug design (SBDD) aims to generate potential drugs that can bind to a target protein and is greatly expedited by the aid of AI techniques in generative models. However, a lack of systematic understanding persists due to the diverse settings, complex implementation, difficult reproducibility, and task singularity. Firstly, the absence of standardization can lead to unfair comparisons and inconclusive insights. To address this dilemma, we propose CBGBench, a comprehensive benchmark for SBDD, that unifies the task as a generative heterogeneous graph completion, analogous to fill-in-the-blank of the 3D complex binding graph. By categorizing existing methods based on their attributes, CBGBench facilitates a modular and extensible framework that implements various cutting-edge methods. Secondly, a single task on \textit{de novo} molecule generation can hardly reflect their capabilities. To broaden the scope, we have adapted these models to a range of tasks essential in drug design, which are considered sub-tasks within the graph fill-in-the-blank tasks. These tasks include the generative designation of \textit{de novo} molecules, linkers, fragments, scaffolds, and sidech...
Development and Validation of a Machine Learning Algorithm for Clinical Wellness Visit Classification in Cats and Dogs.
EN: Early disease detection in veterinary care relies on identifying subclinical abnormalities in asymptomatic animals during wellness visits. This study introduces an algorithm designed to distinguish between wellness and other veterinary visits.The purpose of this study is to validate the use of a visit classification algorithm compared to manual classification of veterinary visits by three board-certified veterinarians. Using a dataset of 11,105 clinical visits from 2012 to 2017 involving 655 animals (85.3% canines and 14.7% felines) across 544 U.S. veterinary establishments, the model was trained using a Gradient Boosting Machine model. Three validators were tasked with classifying 400 visits, including both wellness and other types of visits, selected randomly from the same database used for initial algorithm training, aiming to maintain consistency and relevance between the training and application phases; visit classifications were subsequently categorized into "wellness" or "other" based on majority consensus among validators to assess the algorithm's performance in identifying wellness visits. The algorithm demonstrated a specificity of 0.94 (95% CI: 0.91 to 0.96), implying it...
From Theory to Therapy: Reframing SBDD Model Evaluation via Practical Metrics.
EN: Recent advancements in structure-based drug design (SBDD) have significantly enhanced the efficiency and precision of drug discovery by generating molecules tailored to bind specific protein pockets. Despite these technological strides, their practical application in real-world drug development remains challenging due to the complexities of synthesizing and testing these molecules. The reliability of the Vina docking score, the current standard for assessing binding abilities, is increasingly questioned due to its susceptibility to overfitting. To address these limitations, we propose a comprehensive evaluation framework that includes assessing the similarity of generated molecules to known active compounds, introducing a virtual screening-based metric for practical deployment capabilities, and re-evaluating binding affinity more rigorously. Our experiments reveal that while current SBDD models achieve high Vina scores, they fall short in practical usability metrics, highlighting a significant gap between theoretical predictions and real-world applicability. Our proposed metrics and dataset aim to bridge this gap, enhancing the practical applicability of future SBDD models and alig...
MoleculeCLA: Rethinking Molecular Benchmark via Computational Ligand-Target Binding Analysis.
EN: Molecular representation learning is pivotal for various molecular property prediction tasks related to drug discovery. Robust and accurate benchmarks are essential for refining and validating current methods. Existing molecular property benchmarks derived from wet experiments, however, face limitations such as data volume constraints, unbalanced label distribution, and noisy labels. To address these issues, we construct a large-scale and precise molecular representation dataset of approximately 140,000 small molecules, meticulously designed to capture an extensive array of chemical, physical, and biological properties, derived through a robust computational ligand-target binding analysis pipeline. We conduct extensive experiments on various deep learning models, demonstrating that our dataset offers significant physicochemical interpretability to guide model development and design. Notably, the dataset's properties are linked to binding affinity metrics, providing additional insights into model performance in drug-target interaction tasks. We believe this dataset will serve as a more accurate and reliable benchmark for molecular representation learning, thereby expediting progress...
Scientific Computing with Large Language Models.
EN: We provide an overview of the emergence of large language models for scientific computing applications. We highlight use cases that involve natural language processing of scientific documents and specialized languages designed to describe physical systems. For the former, chatbot style applications appear in medicine, mathematics and physics and can be used iteratively with domain experts for problem solving. We also review specialized languages within molecular biology, the languages of molecules, proteins, and DNA where language models are being used to predict properties and even create novel physical systems at much faster rates than traditional computing methods.
CompassDock: Comprehensive Accurate Assessment Approach for Deep Learning-Based Molecular Docking in Inference and Fine-Tuning.
EN: Datasets used for molecular docking, such as PDBBind, contain technical variability - they are noisy. Although the origins of the noise have been discussed, a comprehensive analysis of the physical, chemical, and bioactivity characteristics of the datasets is still lacking. To address this gap, we introduce the Comprehensive Accurate Assessment (Compass). Compass integrates two key components: PoseCheck, which examines ligand strain energy, protein-ligand steric clashes, and interactions, and AA-Score, a new empirical scoring function for calculating binding affinity energy. Together, these form a unified workflow that assesses both the physical/chemical properties and bioactivity favorability of ligands and protein-ligand interactions. Our analysis of the PDBBind dataset using Compass reveals substantial noise in the ground truth data. Additionally, we propose CompassDock, which incorporates the Compass module with DiffDock, the state-of-the-art deep learning-based molecular docking method, to enable accurate assessment of docked ligands during inference. Finally, we present a new paradigm for enhancing molecular docking model performance by fine-tuning with Compass Scores, which ...
Smiles2Dock: an open large-scale multi-task dataset for ML-based molecular docking.
EN: Docking is a crucial component in drug discovery aimed at predicting the binding conformation and affinity between small molecules and target proteins. ML-based docking has recently emerged as a prominent approach, outpacing traditional methods like DOCK and AutoDock Vina in handling the growing scale and complexity of molecular libraries. However, the availability of comprehensive and user-friendly datasets for training and benchmarking ML-based docking algorithms remains limited. We introduce Smiles2Dock, an open large-scale multi-task dataset for molecular docking. We created a framework combining P2Rank and AutoDock Vina to dock 1.7 million ligands from the ChEMBL database against 15 AlphaFold proteins, giving us more than 25 million protein-ligand binding scores. The dataset leverages a wide range of high-accuracy AlphaFold protein models, encompasses a diverse set of biologically relevant compounds and enables researchers to benchmark all major approaches for ML-based docking such as Graph, Transformer and CNN-based methods. We also introduce a novel Transformer-based architecture for docking scores prediction and set it as an initial benchmark for our dataset. Our dataset an...
Multifidelity digital twin for real-time monitoring of structural dynamics in aquaculture net cages.
EN: As the global population grows and climate change intensifies, sustainable food production is critical. Marine aquaculture offers a viable solution, providing a sustainable protein source. However, the industry's expansion requires novel technologies for remote management and autonomous operations. Digital twin technology can advance the aquaculture industry, but its adoption has been limited. Fish net cages, which are flexible floating structures, are critical yet vulnerable components of aquaculture farms. Exposed to harsh and dynamic marine environments, the cages experience significant loads and risk damage, leading to fish escapes, environmental impacts, and financial losses. We propose a multifidelity surrogate modeling framework for integration into a digital twin for real-time monitoring of aquaculture net cage structural dynamics under stochastic marine conditions. Central to this framework is the nonlinear autoregressive Gaussian process method, which learns complex, nonlinear cross-correlations between models of varying fidelity. It combines low-fidelity simulation data with a small set of high-fidelity field sensor measurements, which offer the real dynamics but are cos...
Cooperative learning of Pl@ntNet's Artificial Intelligence algorithm: how does it work and how can we improve it?.
EN: Deep learning models for plant species identification rely on large annotated datasets. The PlantNet system enables global data collection by allowing users to upload and annotate plant observations, leading to noisy labels due to diverse user skills. Achieving consensus is crucial for training, but the vast scale of collected data makes traditional label aggregation strategies challenging. Existing methods either retain all observations, resulting in noisy training data or selectively keep those with sufficient votes, discarding valuable information. Additionally, as many species are rarely observed, user expertise can not be evaluated as an inter-user agreement: otherwise, botanical experts would have a lower weight in the AI training step than the average user. Our proposed label aggregation strategy aims to cooperatively train plant identification AI models. This strategy estimates user expertise as a trust score per user based on their ability to identify plant species from crowdsourced data. The trust score is recursively estimated from correctly identified species given the current estimated labels. This interpretable score exploits botanical experts' knowledge and the heter...
Structure-based Drug Design Benchmark: Do 3D Methods Really Dominate?.
EN: Currently, the field of structure-based drug design is dominated by three main types of algorithms: search-based algorithms, deep generative models, and reinforcement learning. While existing works have typically focused on comparing models within a single algorithmic category, cross-algorithm comparisons remain scarce. In this paper, to fill the gap, we establish a benchmark to evaluate the performance of sixteen models across these different algorithmic foundations by assessing the pharmaceutical properties of the generated molecules and their docking affinities with specified target proteins. We highlight the unique advantages of each algorithmic approach and offer recommendations for the design of future SBDD models. We emphasize that 1D/2D ligand-centric drug design methods can be used in SBDD by treating the docking function as a black-box oracle, which is typically neglected. The empirical results show that 1D/2D methods achieve competitive performance compared with 3D-based methods that use the 3D structure of the target protein explicitly. Also, AutoGrow4, a 2D molecular graph-based genetic algorithm, dominates SBDD in terms of optimization ability. The relevant code is av...
QCDGE database, Quantum Chemistry Database with Ground- and Excited-state Properties of 450 Kilo Molecules.
EN: Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality databases grows significantly in chemical research. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 heavy atoms including carbon (C), nitrogen (N), oxygen (O), and fluorine (F). Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the $ω$B97X-D/6-31G level. Totally twenty seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry database. Our QCDGE (Quantum Chemistry Database with Ground- and Excited-State Properties) database contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This database, along with its construction protocol, is expected to have a significant impact on the broad applicatio...
QCDGE database, Quantum Chemistry Database with Ground- and Excited-state Properties of 450 Kilo Molecules.
EN: Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality databases grows significantly in chemical research. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 heavy atoms including carbon (C), nitrogen (N), oxygen (O), and fluorine (F). Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the $ω$B97X-D/6-31G level. Totally twenty seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry database. Our QCDGE (Quantum Chemistry Database with Ground- and Excited-State Properties) database contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This database, along with its construction protocol, is expected to have a significant impact on the broad applicatio...
QCDGE database, Quantum Chemistry Database with Ground- and Excited-state Properties of 450 Kilo Molecules.
EN: Due to rapid advancements in deep learning techniques, the demand for large-volume high-quality databases grows significantly in chemical research. We developed a quantum-chemistry database that includes 443,106 small organic molecules with sizes up to 10 heavy atoms including carbon (C), nitrogen (N), oxygen (O), and fluorine (F). Ground-state geometry optimizations and frequency calculations of all compounds were performed at the B3LYP/6-31G level with the BJD3 dispersion correction, while the excited-state single-point calculations were conducted at the $ω$B97X-D/6-31G level. Totally twenty seven molecular properties, such as geometric, thermodynamic, electronic and energetic properties, were gathered from these calculations. Meanwhile, we also established a comprehensive protocol for the construction of a high-volume quantum-chemistry database. Our QCDGE (Quantum Chemistry Database with Ground- and Excited-State Properties) database contains a substantial volume of data, exhibits high chemical diversity, and most importantly includes excited-state information. This database, along with its construction protocol, is expected to have a significant impact on the broad applicatio...
Isotopologue-selective laser cooling of molecules.
EN: Direct laser cooling of molecules has made significant progress in recent years. However, the selective cooling and manipulation of molecules based on their isotopic composition, which is ubiquitous in atomic laser cooling, has not yet been achieved. Here, we demonstrate such isotopologue-selective laser cooling of molecules, using barium monofluoride (BaF) as an example. The manipulation of the rare and previously uncooled 136BaF is achieved within a molecular beam containing several isotopologues of significantly higher natural abundance. Our results enable intense molecular beams and high fidelity detection of select low-abundance isotopologues or isotopic mixtures. Such beams are a first step towards isotopologue-selective molecular trapping and will be useful for applications in trace gas analysis, cold chemistry and precision tests of fundamental symmetries.
Isotopologue-selective laser cooling of molecules.
EN: Direct laser cooling of molecules has made significant progress in recent years. However, the selective cooling and manipulation of molecules based on their isotopic composition, which is ubiquitous in atomic laser cooling, has not yet been achieved. Here, we demonstrate such isotopologue-selective laser cooling of molecules, using barium monofluoride (BaF) as an example. The manipulation of the rare and previously uncooled 136BaF is achieved within a molecular beam containing several isotopologues of significantly higher natural abundance. Our results enable intense molecular beams and high fidelity detection of select low-abundance isotopologues or isotopic mixtures. Such beams are a first step towards isotopologue-selective molecular trapping and will be useful for applications in trace gas analysis, cold chemistry and precision tests of fundamental symmetries.
Isotopologue-selective laser cooling of molecules.
EN: Direct laser cooling of molecules has made significant progress in recent years. However, the selective cooling and manipulation of molecules based on their isotopic composition, which is ubiquitous in atomic laser cooling, has not yet been achieved. Here, we demonstrate such isotopologue-selective laser cooling of molecules, using barium monofluoride (BaF) as an example. The manipulation of the rare and previously uncooled 136BaF is achieved within a molecular beam containing several isotopologues of significantly higher natural abundance. Our results enable intense molecular beams and high fidelity detection of select low-abundance isotopologues or isotopic mixtures. Such beams are a first step towards isotopologue-selective molecular trapping and will be useful for applications in trace gas analysis, cold chemistry and precision tests of fundamental symmetries.
Automatic Fused Multimodal Deep Learning for Plant Identification.
EN: Plant classification is vital for ecological conservation and agricultural productivity, enhancing our understanding of plant growth dynamics and aiding species preservation. The advent of deep learning (DL) techniques has revolutionized this field by enabling autonomous feature extraction, significantly reducing the dependence on manual expertise. However, conventional DL models often rely solely on single data sources, failing to capture the full biological diversity of plant species comprehensively. Recent research has turned to multimodal learning to overcome this limitation by integrating multiple data types, which enriches the representation of plant characteristics. This shift introduces the challenge of determining the optimal point for modality fusion. In this paper, we introduce a pioneering multimodal DL-based approach for plant classification with automatic modality fusion. Utilizing the multimodal fusion architecture search, our method integrates images from multiple plant organs -- flowers, leaves, fruits, and stems -- into a cohesive model. To address the lack of multimodal datasets, we contributed Multimodal-PlantCLEF, a restructured version of the PlantCLEF2015 dat...
TAGMol: Target-Aware Gradient-guided Molecule Generation.
EN: 3D generative models have shown significant promise in structure-based drug design (SBDD), particularly in discovering ligands tailored to specific target binding sites. Existing algorithms often focus primarily on ligand-target binding, characterized by binding affinity. Moreover, models trained solely on target-ligand distribution may fall short in addressing the broader objectives of drug discovery, such as the development of novel ligands with desired properties like drug-likeness, and synthesizability, underscoring the multifaceted nature of the drug design process. To overcome these challenges, we decouple the problem into molecular generation and property prediction. The latter synergistically guides the diffusion sampling process, facilitating guided diffusion and resulting in the creation of meaningful molecules with the desired properties. We call this guided molecular generation process as TAGMol. Through experiments on benchmark datasets, TAGMol demonstrates superior performance compared to state-of-the-art baselines, achieving a 22% improvement in average Vina Score and yielding favorable outcomes in essential auxiliary properties. This establishes TAGMol as a comprehe...
Robust Biharmonic Skinning Using Geometric Fields.
EN: Skinning is a popular way to rig and deform characters for animation, to compute reduced-order simulations, and to define features for geometry processing. Methods built on skinning rely on weight functions that distribute the influence of each degree of freedom across the mesh. Automatic skinning methods generate these weight functions with minimal user input, usually by solving a variational problem on a mesh whose boundary is the skinned surface. This formulation necessitates tetrahedralizing the volume bounded by the surface, which brings with it meshing artifacts, the possibility of tetrahedralization failure, and the impossibility of generating weights for surfaces that are not closed. We introduce a mesh-free and robust automatic skinning method that generates high-quality skinning weights comparable to the current state of the art without volumetric meshes. Our method reliably works even on open surfaces and triangle soups where current methods fail. We achieve this through the use of a Lagrangian representation for skinning weights, which circumvents the need for finite elements while optimizing the biharmonic energy.
Beyond Conventional Parametric Modeling: Data-Driven Framework for Estimation and Prediction of Time Activity Curves in Dynamic PET Imaging.
EN: Dynamic Positron Emission Tomography (dPET) imaging and Time-Activity Curve (TAC) analyses are essential for understanding and quantifying the biodistribution of radiopharmaceuticals over time and space. Traditional compartmental modeling, while foundational, commonly struggles to fully capture the complexities of biological systems, including non-linear dynamics and variability. This study introduces an innovative data-driven neural network-based framework, inspired by Reaction Diffusion systems, designed to address these limitations. Our approach, which adaptively fits TACs from dPET, enables the direct calibration of diffusion coefficients and reaction terms from observed data, offering significant improvements in predictive accuracy and robustness over traditional methods, especially in complex biological scenarios. By more accurately modeling the spatio-temporal dynamics of radiopharmaceuticals, our method advances modeling of pharmacokinetic and pharmacodynamic processes, enabling new possibilities in quantitative nuclear medicine.
BInD: Bond and Interaction-generating Diffusion Model for Multi-objective Structure-based Drug Design.
EN: A remarkable advance in geometric deep generative models with accumulated structural data enables structure-based drug design (SBDD) with target protein information only. However, most existing models struggle to address multi-objectives simultaneously while performing well only in their specialized tasks. Here, we present BInD, a diffusion model with knowledge-based guidance for multi-objective SBDD. BInD is designed to co-generate molecules and their interactions with a target protein to consider all key objectives equally well, including target-specific interactions, molecular properties, and local geometry. Comprehensive evaluations show that BInD achieves robust performance for all objectives while outperforming or matching state-of-the-art methods for each. Finally, we propose a train-free optimization method empowered by retrieving target-specific interactions, highlighting the role of non-covalent interactions in achieving higher selectivity and binding affinities to a target protein.
Assessing the potential of deep learning for protein-ligand docking.
EN: The effects of ligand binding on protein structures and their in vivo functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of the latest docking and structure prediction methods within the broadly applicable context of (1) using predicted (apo) protein structures for docking (e.g., for applicability to new proteins); (2) binding multiple (cofactor) ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for generalization to unknown pockets). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for broadly applicable protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL methods for apo-to-holo protein-ligand docking and protein-ligand structure prediction using both primary ligand and multi-ligand benchmark datasets, the ...
Animal Behavior Analysis Methods Using Deep Learning: A Survey.
EN: Animal behavior serves as a reliable indicator of the adaptation of organisms to their environment and their overall well-being. Through rigorous observation of animal actions and interactions, researchers and observers can glean valuable insights into diverse facets of their lives, encompassing health, social dynamics, ecological relationships, and neuroethological dimensions. Although state-of-the-art deep learning models have demonstrated remarkable accuracy in classifying various forms of animal data, their adoption in animal behavior studies remains limited. This survey article endeavors to comprehensively explore deep learning architectures and strategies applied to the identification of animal behavior, spanning auditory, visual, and audiovisual methodologies. Furthermore, the manuscript scrutinizes extant animal behavior datasets, offering a detailed examination of the principal challenges confronting this research domain. The article culminates in a comprehensive discussion of key research directions within deep learning that hold potential for advancing the field of animal behavior studies.
More than just smoke and mirrors: Gas-phase polaritons for optical control of chemistry.
EN: Gas-phase molecules are a promising platform through which to elucidate the mechanisms of action and scope of polaritons for optical control of chemistry. Polaritons arise from the strong coupling of a dipole-allowed molecular transition with the photonic mode of an optical cavity. There is mounting evidence of modified reactivity under polaritonic conditions; however, the complex condensed-phase environment of most experimental demonstrations impedes mechanistic understanding of this phenomenon. While the gas phase was the playground of early efforts in atomic cavity quantum electrodynamics, we have only recently demonstrated the formation of molecular polaritons under these conditions. Studying the reactivity of isolated gas-phase molecules under strong coupling would eliminate solvent interactions and enable quantum state resolution of reaction progress. In this Perspective, we contextualize recent gas-phase efforts in the field of polariton chemistry and offer a practical guide for experiment design moving forward.
More than just smoke and mirrors: Gas-phase polaritons for optical control of chemistry.
EN: Gas-phase molecules are a promising platform through which to elucidate the mechanisms of action and scope of polaritons for optical control of chemistry. Polaritons arise from the strong coupling of a dipole-allowed molecular transition with the photonic mode of an optical cavity. There is mounting evidence of modified reactivity under polaritonic conditions; however, the complex condensed-phase environment of most experimental demonstrations impedes mechanistic understanding of this phenomenon. While the gas phase was the playground of early efforts in atomic cavity quantum electrodynamics, we have only recently demonstrated the formation of molecular polaritons under these conditions. Studying the reactivity of isolated gas-phase molecules under strong coupling would eliminate solvent interactions and enable quantum state resolution of reaction progress. In this Perspective, we contextualize recent gas-phase efforts in the field of polariton chemistry and offer a practical guide for experiment design moving forward.
More than just smoke and mirrors: Gas-phase polaritons for optical control of chemistry.
EN: Gas-phase molecules are a promising platform through which to elucidate the mechanisms of action and scope of polaritons for optical control of chemistry. Polaritons arise from the strong coupling of a dipole-allowed molecular transition with the photonic mode of an optical cavity. There is mounting evidence of modified reactivity under polaritonic conditions; however, the complex condensed-phase environment of most experimental demonstrations impedes mechanistic understanding of this phenomenon. While the gas phase was the playground of early efforts in atomic cavity quantum electrodynamics, we have only recently demonstrated the formation of molecular polaritons under these conditions. Studying the reactivity of isolated gas-phase molecules under strong coupling would eliminate solvent interactions and enable quantum state resolution of reaction progress. In this Perspective, we contextualize recent gas-phase efforts in the field of polariton chemistry and offer a practical guide for experiment design moving forward.
Guided Multi-objective Generative AI to Enhance Structure-based Drug Design.
EN: Generative AI has the potential to revolutionize drug discovery. Yet, despite recent advances in deep learning, existing models cannot generate molecules that satisfy all desired physicochemical properties. Herein, we describe IDOLpro, a generative chemistry AI combining diffusion with multi-objective optimization for structure-based drug design. Differentiable scoring functions guide the latent variables of the diffusion model to explore uncharted chemical space and generate novel ligands in silico, optimizing a plurality of target physicochemical properties. We demonstrate our platform's effectiveness by generating ligands with optimized binding affinity and synthetic accessibility on two benchmark sets. IDOLpro produces ligands with binding affinities over 10%-20% better than the next best state-of-the-art method on each test set, producing more drug-like molecules with generally better synthetic accessibility scores than other methods. We do a head-to-head comparison of IDOLpro against a classic virtual screen of a large database of drug-like molecules. We show that IDOLpro can generate molecules for a range of important disease-related targets with better binding affinity and ...
QComp: A QSAR-Based Data Completion Framework for Drug Discovery.
EN: In drug discovery, in vitro and in vivo experiments reveal biochemical activities related to the efficacy and toxicity of compounds. The experimental data accumulate into massive, ever-evolving, and sparse datasets. Quantitative Structure-Activity Relationship (QSAR) models, which predict biochemical activities using only the structural information of compounds, face challenges in integrating the evolving experimental data as studies progress. We develop QSAR-Complete (QComp), a data completion framework to address this issue. Based on pre-existing QSAR models, QComp utilizes the correlation inherent in experimental data to enhance prediction accuracy across various tasks. Moreover, QComp emerges as a promising tool for guiding the optimal sequence of experiments by quantifying the reduction in statistical uncertainty for specific endpoints, thereby aiding in rational decision-making throughout the drug discovery process.
Challenges and opportunities for digital twins in precision medicine: a complex systems perspective.
EN: The adoption of digital twins (DTs) in precision medicine is increasingly viable, propelled by extensive data collection and advancements in artificial intelligence (AI), alongside traditional biomedical methodologies. However, the reliance on black-box predictive models, which utilize large datasets, presents limitations that could impede the broader application of DTs in clinical settings. We argue that hypothesis-driven generative models, particularly multiscale modeling, are essential for boosting the clinical accuracy and relevance of DTs, thereby making a significant impact on healthcare innovation. This paper explores the transformative potential of DTs in healthcare, emphasizing their capability to simulate complex, interdependent biological processes across multiple scales. By integrating generative models with extensive datasets, we propose a scenario-based modeling approach that enables the exploration of diverse therapeutic strategies, thus supporting dynamic clinical decision-making. This method not only leverages advancements in data science and big data for improving disease treatment and prevention but also incorporates insights from complex systems and network scie...
Evaluation of In vitro anti-inflammatory activity and Insilico pharmacokinetics and molecular docking study of Horsfieldia iryaghedhi.
EN: Phytochemicals are still a valuable source to develop clinically important drugs in treating chronic and acute diseases. Inflammation is a response to an injurious stimulus of the body and novel therapeutic agents are needed to alleviate the condition with minimum side effects. Matured and fully expanded fresh leaves and barks of H. iryaghedhi were collected, and the extractions were obtained cold maceration using 99.9% methanol and distilled water as solvents. A concentration series was then developed, and the anti-inflammatory activity was evaluated against Diclofenac sodium as the positive control, using the heat-induced egg albumin denaturation method. Further, selected phytochemicals were tested against COX-2 enzyme (PDB ID: 5IKR) using site-specific molecular docking with autodock vina and the binding energies and pharmacokinetic and toxicity parameters were evaluated. Results: The methanol and aqueous extracts have shown a moderate to strong concentration-dependent anti-inflammatory activity with reference to standard Diclofenac sodium and Methanol bark extract exhibited potent anti-inflammatory activity compared to other extracts . Further, Methanol and aqueous extracts sho...
Controlled molecule injector for cold, dense, and pure molecular beams at the European x-ray free-electron laser.
EN: A permanently available molecular-beam injection setup for controlled molecules (COMO) was installed and commissioned at the small quantum systems (SQS) instrument at the European x-ray free-electron laser (EuXFEL). A $b$-type electrostatic deflector allows for pure state-, size-, and isomer-selected samples of polar molecules and clusters. The source provides a rotationally cold ($T\approx1$~K) and dense ($ρ\approx10^8$~cm$^{-3}$) molecular beam with pulse durations up to 100~\us generated by a new version of the Even-Lavie valve. Here, a performance overview of the COMO setup is presented along with characterization experiments performed both, with an optical laser at the Center for Free-Electron-Laser Science and with x-rays at EuXFEL under burst-mode operation. COMO was designed to be attached to different instruments at the EuXFEL, in particular at the small quantum systems (SQS) and single particles, clusters, and biomolecules (SPB) instruments. This advanced controlled-molecules injection setup enables XFEL studies using highly defined samples with soft and hard x-ray FEL radiation for applications ranging from atomic, molecular, and cluster physics to elementary processes i...
Controlled molecule injector for cold, dense, and pure molecular beams at the European x-ray free-electron laser.
EN: A permanently available molecular-beam injection setup for controlled molecules (COMO) was installed and commissioned at the small quantum systems (SQS) instrument at the European x-ray free-electron laser (EuXFEL). A $b$-type electrostatic deflector allows for pure state-, size-, and isomer-selected samples of polar molecules and clusters. The source provides a rotationally cold ($T\approx1$~K) and dense ($ρ\approx10^8$~cm$^{-3}$) molecular beam with pulse durations up to 100~\us generated by a new version of the Even-Lavie valve. Here, a performance overview of the COMO setup is presented along with characterization experiments performed both, with an optical laser at the Center for Free-Electron-Laser Science and with x-rays at EuXFEL under burst-mode operation. COMO was designed to be attached to different instruments at the EuXFEL, in particular at the small quantum systems (SQS) and single particles, clusters, and biomolecules (SPB) instruments. This advanced controlled-molecules injection setup enables XFEL studies using highly defined samples with soft and hard x-ray FEL radiation for applications ranging from atomic, molecular, and cluster physics to elementary processes i...
Controlled molecule injector for cold, dense, and pure molecular beams at the European x-ray free-electron laser.
EN: A permanently available molecular-beam injection setup for controlled molecules (COMO) was installed and commissioned at the small quantum systems (SQS) instrument at the European x-ray free-electron laser (EuXFEL). A $b$-type electrostatic deflector allows for pure state-, size-, and isomer-selected samples of polar molecules and clusters. The source provides a rotationally cold ($T\approx1$~K) and dense ($ρ\approx10^8$~cm$^{-3}$) molecular beam with pulse durations up to 100~\us generated by a new version of the Even-Lavie valve. Here, a performance overview of the COMO setup is presented along with characterization experiments performed both, with an optical laser at the Center for Free-Electron-Laser Science and with x-rays at EuXFEL under burst-mode operation. COMO was designed to be attached to different instruments at the EuXFEL, in particular at the small quantum systems (SQS) and single particles, clusters, and biomolecules (SPB) instruments. This advanced controlled-molecules injection setup enables XFEL studies using highly defined samples with soft and hard x-ray FEL radiation for applications ranging from atomic, molecular, and cluster physics to elementary processes i...
Perturbing Dynamics of Active Emulsions and Their Collectives.
EN: Controlling fluidic flows in active droplets is crucial in developing intelligent models to understand and mimic single-celled microorganisms. Typically, these fluidic flows are affected by the interfacial dynamics of chemical agents. We found that these flows can be reconfigured by the mere presence of anisotropic solid boundary embedded within active droplets. Spontaneous fluidic flows dynamically orient an embedded magnetic cluster and the magnetic cluster, when realigned, causes these flows to reorient. Thus, providing an unprecedented control over the propulsion dynamics of chemotactic emulsions. When continuously perturbed, achiral emulsions exhibit emergent chiral motion with rotating fluidic flows. Such solid-fluid interactions removes barriers of specific emulsion chemistries and complements their inherent abilities thereby also enabling control over emergent collective behaviors of active droplets.
Predicting the binding of small molecules to proteins through invariant representation of the molecular structure.
EN: We present a computational scheme for predicting the ligands that bind to a pocket of known structure. It is based on the generation of a general abstract representation of the molecules, which is invariant to rotations, translations and permutations of atoms, and has some degree of isometry with the space of conformations. We use these representations to train a non-deep machine learning algorithm to classify the binding between pockets and molecule pairs, and show that this approach has a better generalization capability than existing methods.
DrugLLM: Open Large Language Model for Few-shot Molecule Generation.
EN: Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
DrugLLM: Open Large Language Model for Few-shot Molecule Generation.
EN: Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
DrugLLM: Open Large Language Model for Few-shot Molecule Generation.
EN: Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
DrugLLM: Open Large Language Model for Few-shot Molecule Generation.
EN: Large Language Models (LLMs) have made great strides in areas such as language processing and computer vision. Despite the emergence of diverse techniques to improve few-shot learning capacity, current LLMs fall short in handling the languages in biology and chemistry. For example, they are struggling to capture the relationship between molecule structure and pharmacochemical properties. Consequently, the few-shot learning capacity of small-molecule drug modification remains impeded. In this work, we introduced DrugLLM, a LLM tailored for drug design. During the training process, we employed Group-based Molecular Representation (GMR) to represent molecules, arranging them in sequences that reflect modifications aimed at enhancing specific molecular properties. DrugLLM learns how to modify molecules in drug discovery by predicting the next molecule based on past modifications. Extensive computational experiments demonstrate that DrugLLM can generate new molecules with expected properties based on limited examples, presenting a powerful few-shot molecule generation capacity.
A Survey of Few-Shot Learning for Biomedical Time Series.
EN: Advancements in wearable sensor technologies and the digitization of medical records have contributed to the unprecedented ubiquity of biomedical time series data. Data-driven models have tremendous potential to assist clinical diagnosis and improve patient care by improving long-term monitoring capabilities, facilitating early disease detection and intervention, as well as promoting personalized healthcare delivery. However, accessing extensively labeled datasets to train data-hungry deep learning models encounters many barriers, such as long-tail distribution of rare diseases, cost of annotation, privacy and security concerns, data-sharing regulations, and ethical considerations. An emerging approach to overcome the scarcity of labeled data is to augment AI methods with human-like capabilities to leverage past experiences to learn new tasks with limited examples, called few-shot learning. This survey provides a comprehensive review and comparison of few-shot learning methods for biomedical time series applications. The clinical benefits and limitations of such methods are discussed in relation to traditional data-driven approaches. This paper aims to provide insights into the cur...
A Survey of Few-Shot Learning for Biomedical Time Series.
EN: Advancements in wearable sensor technologies and the digitization of medical records have contributed to the unprecedented ubiquity of biomedical time series data. Data-driven models have tremendous potential to assist clinical diagnosis and improve patient care by improving long-term monitoring capabilities, facilitating early disease detection and intervention, as well as promoting personalized healthcare delivery. However, accessing extensively labeled datasets to train data-hungry deep learning models encounters many barriers, such as long-tail distribution of rare diseases, cost of annotation, privacy and security concerns, data-sharing regulations, and ethical considerations. An emerging approach to overcome the scarcity of labeled data is to augment AI methods with human-like capabilities to leverage past experiences to learn new tasks with limited examples, called few-shot learning. This survey provides a comprehensive review and comparison of few-shot learning methods for biomedical time series applications. The clinical benefits and limitations of such methods are discussed in relation to traditional data-driven approaches. This paper aims to provide insights into the cur...
Intermittent thermal convection in jammed emulsions.
EN: We study the process of thermal convection in jammed emulsions with a yield-stress rheology. We find that heat transfer occurs via an intermittent mechanism, whereby intense short-lived convective "heat bursts" are spaced out by long-lasting conductive periods. This behaviour is the result of a sequence of fluidization-rigidity transitions, rooted in a non-trivial interplay between emulsion yield-stress rheology and plastic activity, which we characterize via a statistical analysis of the dynamics at the droplet scale. We also show that droplets' coalescence induced during heat bursts leads to a spatially heterogeneous phase-inversion of the emulsion which eventually supports a sustained convective state.
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.
EN: Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.
BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers.
EN: Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due to the deficiency of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combination of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever's efficacy on various biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7 times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are released at \url{https://huggingface.co/BMRetriever} to ensure transparency, reproducibility, and application to new domains.
MediFact at MEDIQA-M3G 2024: Medical Question Answering in Dermatology with Multimodal Learning.
EN: The MEDIQA-M3G 2024 challenge necessitates novel solutions for Multilingual & Multimodal Medical Answer Generation in dermatology (wai Yim et al., 2024a). This paper addresses the limitations of traditional methods by proposing a weakly supervised learning approach for open-ended medical question-answering (QA). Our system leverages readily available MEDIQA-M3G images via a VGG16-CNN-SVM model, enabling multilingual (English, Chinese, Spanish) learning of informative skin condition representations. Using pre-trained QA models, we further bridge the gap between visual and textual information through multimodal fusion. This approach tackles complex, open-ended questions even without predefined answer choices. We empower the generation of comprehensive answers by feeding the ViT-CLIP model with multiple responses alongside images. This work advances medical QA research, paving the way for clinical decision support systems and ultimately improving healthcare delivery.
Thermodynamic Origin of Water's Thermal Conductivity Maximum.
EN: The thermal conductivity of water features a maximum (TCM) as a function of temperature at constant pressure. By examining why molecular force fields succeed or fail to reproduce the maximum and interpreting our results using the Bridgman equation, we show that water's TCM is connected with its compressibility minimum. Using Stillinger-Weber potentials for tetrahedral liquids, we interpolate between the behaviour of simple liquids and highly tetrahedral materials such as carbon. Together with two vanishing limits at low/high tetrahedrality, we identify three regimes for the TCM: when it originates from either the compressibility minimum or density maximum, or both. Thus, the TCM exists in a "Goldilocks Zone" of tetrahedral order. We provide a thermodynamic explanation for the TCM of not only water, but tetrahedral liquids in general.
Machine Learning Applied to the Detection of Mycotoxin in Food: A Review.
EN: Mycotoxins, toxic secondary metabolites produced by certain fungi, pose significant threats to global food safety and public health. These compounds can contaminate a variety of crops, leading to economic losses and health risks to both humans and animals. Traditional lab analysis methods for mycotoxin detection can be time-consuming and may not always be suitable for large-scale screenings. However, in recent years, machine learning (ML) methods have gained popularity for use in the detection of mycotoxins and in the food safety industry in general, due to their accurate and timely predictions. We provide a systematic review on some of the recent ML applications for detecting/predicting the presence of mycotoxin on a variety of food ingredients, highlighting their advantages, challenges, and potential for future advancements. We address the need for reproducibility and transparency in ML research through open access to data and code. An observation from our findings is the frequent lack of detailed reporting on hyperparameters in many studies as well as a lack of open source code, which raises concerns about the reproducibility and optimisation of the ML models used. The findings ...
Molecular Docking via Weighted Subgraph Isomorphism on Quantum Annealers.
EN: Molecular docking is an essential step in the drug discovery process involving the detection of three-dimensional poses of a ligand inside the active site of the protein. In this paper, we address the Molecular Docking search phase by formulating the problem in QUBO terms, suitable for an annealing approach. We propose a problem formulation as a weighted subgraph isomorphism between the ligand graph and the grid of the target protein pocket. In particular, we applied a graph representation to the ligand embedding all the geometrical properties of the molecule including its flexibility, and we created a weighted spatial grid to the 3D space region inside the pocket. Results and performance obtained with quantum annealers are compared with classical simulated annealing solvers.
Detection and prebiotic chemistry of possible glycine precursor molecule methylenimine towards the hot molecular core G10.47+0.03.
EN: Amino acids are essential for the synthesis of protein. Amino acids contain both amine (R$-$NH${2}$) and carboxylic acid (R$-$COOH) functional groups, which help to understand the possible formation mechanism of life in the universe. Among the 20 types of amino acids, glycine (NH${2}$CH${2}$COOH) is known as the simplest non-essential amino acid. In the last 40 years, all surveys of NH${2}$CH${2}$COOH in the interstellar medium, especially in the star-formation regions, have failed at the millimeter and sub-millimeter wavelengths. We aimed to identify the possible precursors of NH${2}$CH${2}$COOH, because it is highly challenging to identify NH${2}$CH${2}$COOH in the interstellar medium. Many laboratory experiments have suggested that methylenimine (CH${2}$NH) plays a key role as a possible precursor of NH${2}$CH${2}$COOH in the star-formation regions via the Strecker synthesis reaction. After spectral analysis using the local thermodynamic equilibrium (LTE) model, we successfully identified the rotational emission lines of CH${2}$NH towards the hot molecular core G10.47+0.03 using the Atacama Compact Array (ACA). The estimated column density of CH${2}$NH towards G10....
MolCRAFT: Structure-Based Drug Design in Continuous Parameter Space.
EN: Generative models for structure-based drug design (SBDD) have shown promising results in recent years. Existing works mainly focus on how to generate molecules with higher binding affinity, ignoring the feasibility prerequisites for generated 3D poses and resulting in false positives. We conduct thorough studies on key factors of ill-conformational problems when applying autoregressive methods and diffusion to SBDD, including mode collapse and hybrid continuous-discrete space. In this paper, we introduce MolCRAFT, the first SBDD model that operates in the continuous parameter space, together with a novel noise reduced sampling strategy. Empirical results show that our model consistently achieves superior performance in binding affinity with more stable 3D structure, demonstrating our ability to accurately model interatomic interactions. To our best knowledge, MolCRAFT is the first to achieve reference-level Vina Scores (-6.59 kcal/mol) with comparable molecular size, outperforming other strong baselines by a wide margin (-0.84 kcal/mol). Code is available at https://github.com/AlgoMole/MolCRAFT.
GeoDirDock: Guiding Docking Along Geodesic Paths.
EN: This work introduces GeoDirDock (GDD), a novel approach to molecular docking that enhances the accuracy and physical plausibility of ligand docking predictions. GDD guides the denoising process of a diffusion model along geodesic paths within multiple spaces representing translational, rotational, and torsional degrees of freedom. Our method leverages expert knowledge to direct the generative modeling process, specifically targeting desired protein-ligand interaction regions. We demonstrate that GDD significantly outperforms existing blind docking methods in terms of RMSD accuracy and physicochemical pose realism. Our results indicate that incorporating domain expertise into the diffusion process leads to more biologically relevant docking predictions. Additionally, we explore the potential of GDD for lead optimization in drug discovery through angle transfer in maximal common substructure (MCS) docking, showcasing its capability to predict ligand orientations for chemically similar compounds accurately.
Does Biomedical Training Lead to Better Medical Performance?.
EN: Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.
Does Biomedical Training Lead to Better Medical Performance?.
EN: Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.
AUTODIFF: Autoregressive Diffusion Modeling for Structure-based Drug Design.
EN: Structure-based drug design (SBDD), which aims to generate molecules that can bind tightly to the target protein, is an essential problem in drug discovery, and previous approaches have achieved initial success. However, most existing methods still suffer from invalid local structure or unrealistic conformation issues, which are mainly due to the poor leaning of bond angles or torsional angles. To alleviate these problems, we propose AUTODIFF, a diffusion-based fragment-wise autoregressive generation model. Specifically, we design a novel molecule assembly strategy named conformal motif that preserves the conformation of local structures of molecules first, then we encode the interaction of the protein-ligand complex with an SE(3)-equivariant convolutional network and generate molecules motif-by-motif with diffusion modeling. In addition, we also improve the evaluation framework of SBDD by constraining the molecular weights of the generated molecules in the same range, together with some new metrics, which make the evaluation more fair and practical. Extensive experiments on CrossDocked2020 demonstrate that our approach outperforms the existing models in generating realistic molecu...
Universal Bovine Identification via Depth Data and Deep Metric Learning.
EN: This paper proposes and evaluates, for the first time, a top-down (dorsal view), depth-only deep learning system for accurately identifying individual cattle and provides associated code, datasets, and training weights for immediate reproducibility. An increase in herd size skews the cow-to-human ratio at the farm and makes the manual monitoring of individuals more challenging. Therefore, real-time cattle identification is essential for the farms and a crucial step towards precision livestock farming. Underpinned by our previous work, this paper introduces a deep-metric learning method for cattle identification using depth data from an off-the-shelf 3D camera. The method relies on CNN and MLP backbones that learn well-generalised embedding spaces from the body shape to differentiate individuals -- requiring neither species-specific coat patterns nor close-up muzzle prints for operation. The network embeddings are clustered using a simple algorithm such as $k$-NN for highly accurate identification, thus eliminating the need to retrain the network for enrolling new individuals. We evaluate two backbone architectures, ResNet, as previously used to identify Holstein Friesians using RGB...
Residual-based Language Models are Free Boosters for Biomedical Imaging.
EN: In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
Residual-based Language Models are Free Boosters for Biomedical Imaging.
EN: In this study, we uncover the unexpected efficacy of residual-based large language models (LLMs) as part of encoders for biomedical imaging tasks, a domain traditionally devoid of language or textual data. The approach diverges from established methodologies by utilizing a frozen transformer block, extracted from pre-trained LLMs, as an innovative encoder layer for the direct processing of visual tokens. This strategy represents a significant departure from the standard multi-modal vision-language frameworks, which typically hinge on language-driven prompts and inputs. We found that these LLMs could boost performance across a spectrum of biomedical imaging applications, including both 2D and 3D visual classification tasks, serving as plug-and-play boosters. More interestingly, as a byproduct, we found that the proposed framework achieved superior performance, setting new state-of-the-art results on extensive, standardized datasets in MedMNIST-2D and 3D. Through this work, we aim to open new avenues for employing LLMs in biomedical imaging and enriching the understanding of their potential in this specialized domain.
Condensed-Phase Quantum Chemistry.
EN: Molecular quantum chemistry has seen enormous progress in the last few decades thanks to the more advanced and sophisticated numerical techniques and computing power. Following the recent interest in extending these capabilities to condensed-phase problems, we summarize basic knowledge of condensed-phase quantum chemistry for ones with experience in molecular quantum chemistry. We highlight recent efforts in this direction, including solving the electron repulsion integrals bottleneck and implementing hybrid density functional theory and wavefunction methods, and lattice dynamics for periodic systems within atom-centered basis sets. Many computational techniques presented here are inspired by the extensive method developments rooted in quantum chemistry. In this Focus Article, we selectively focus on the computational techniques rooted in molecular quantum chemistry, emphasize some challenges, and point out open questions. We hope our perspectives will encourage researchers to pursue this exciting and promising research avenue.
Condensed-Phase Quantum Chemistry.
EN: Molecular quantum chemistry has seen enormous progress in the last few decades thanks to the more advanced and sophisticated numerical techniques and computing power. Following the recent interest in extending these capabilities to condensed-phase problems, we summarize basic knowledge of condensed-phase quantum chemistry for ones with experience in molecular quantum chemistry. We highlight recent efforts in this direction, including solving the electron repulsion integrals bottleneck and implementing hybrid density functional theory and wavefunction methods, and lattice dynamics for periodic systems within atom-centered basis sets. Many computational techniques presented here are inspired by the extensive method developments rooted in quantum chemistry. In this Focus Article, we selectively focus on the computational techniques rooted in molecular quantum chemistry, emphasize some challenges, and point out open questions. We hope our perspectives will encourage researchers to pursue this exciting and promising research avenue.
Strangers in a foreign land: 'Yeastizing' plant enzymes.
EN: Expressing plant metabolic pathways in microbial platforms is an efficient, cost-effective solution for producing many desired plant compounds. As eukaryotic organisms, yeasts are often the preferred platform. However, expression of plant enzymes in a yeast frequently leads to failure because the enzymes are poorly adapted to the foreign yeast cellular environment. Here we first summarize current engineering approaches for optimizing performance of plant enzymes in yeast. A critical limitation of these approaches is that they are labor-intensive and must be customized for each individual enzyme, which significantly hinders the establishment of plant pathways in cellular factories. In response to this challenge, we propose the development of a cost-effective computational pipeline to redesign plant enzymes for better adaptation to the yeast cellular milieu. This proposition is underpinned by compelling evidence that plant and yeast enzymes exhibit distinct sequence features that are generalizable across enzyme families. Consequently, we introduce a data-driven machine learning framework designed to extract 'yeastizing' rules from natural protein sequence variations, which can be bro...
Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization.
EN: The state recognition of the environment and objects by robots is generally based on the judgement of the current state as a classification problem. On the other hand, state changes of food in cooking happen continuously and need to be captured not only at a certain time point but also continuously over time. In addition, the state changes of food are complex and cannot be easily described by manual programming. Therefore, we propose a method to recognize the continuous state changes of food for cooking robots through the spoken language using pre-trained large-scale vision-language models. By using models that can compute the similarity between images and texts continuously over time, we can capture the state changes of food while cooking. We also show that by adjusting the weighting of each text prompt based on fitting the similarity changes to a sigmoid function and then performing black-box optimization, more accurate and robust continuous state recognition can be achieved. We demonstrate the effectiveness and limitations of this method by performing the recognition of water boiling, butter melting, egg cooking, and onion stir-frying.
Advancing Chinese biomedical text mining with community challenges.
EN: Objective: This study aims to review the recent advances in community challenges for biomedical text mining in China. Methods: We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Results: We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, whi...
Advancing Chinese biomedical text mining with community challenges.
EN: Objective: This study aims to review the recent advances in community challenges for biomedical text mining in China. Methods: We collected information of evaluation tasks released in community challenges of biomedical text mining, including task description, dataset description, data source, task type and related links. A systematic summary and comparative analysis were conducted on various biomedical natural language processing tasks, such as named entity recognition, entity normalization, attribute extraction, relation extraction, event extraction, text classification, text similarity, knowledge graph construction, question answering, text generation, and large language model evaluation. Results: We identified 39 evaluation tasks from 6 community challenges that spanned from 2017 to 2023. Our analysis revealed the diverse range of evaluation task types and data sources in biomedical text mining. We explored the potential clinical applications of these community challenge tasks from a translational biomedical informatics perspective. We compared with their English counterparts, and discussed the contributions, limitations, lessons and guidelines of these community challenges, whi...
DecompOpt: Controllable and Decomposed Diffusion Models for Structure-based Molecular Optimization.
EN: Recently, 3D generative models have shown promising performances in structure-based drug design by learning to generate ligands given target binding sites. However, only modeling the target-ligand distribution can hardly fulfill one of the main goals in drug discovery -- designing novel ligands with desired properties, e.g., high binding affinity, easily synthesizable, etc. This challenge becomes particularly pronounced when the target-ligand pairs used for training do not align with these desired properties. Moreover, most existing methods aim at solving \textit{de novo} design task, while many generative scenarios requiring flexible controllability, such as R-group optimization and scaffold hopping, have received little attention. In this work, we propose DecompOpt, a structure-based molecular optimization method based on a controllable and decomposed diffusion model. DecompOpt presents a new generation paradigm which combines optimization with conditional diffusion models to achieve desired properties while adhering to the molecular grammar. Additionally, DecompOpt offers a unified framework covering both \textit{de novo} design and controllable generation. To achieve so, ligand...
On the Efficient Marginalization of Probabilistic Sequence Models.
EN: Real-world data often exhibits sequential dependence, across diverse domains such as human behavior, medicine, finance, and climate modeling. Probabilistic methods capture the inherent uncertainty associated with prediction in these contexts, with autoregressive models being especially prominent. This dissertation focuses on using autoregressive models to answer complex probabilistic queries that go beyond single-step prediction, such as the timing of future events or the likelihood of a specific event occurring before another. In particular, we develop a broad class of novel and efficient approximation techniques for marginalization in sequential models that are model-agnostic. These techniques rely solely on access to and sampling from next-step conditional distributions of a pre-trained autoregressive model, including both traditional parametric models as well as more recent neural autoregressive models. Specific approaches are presented for discrete sequential models, for marked temporal point processes, and for stochastic jump processes, each tailored to a well-defined class of informative, long-range probabilistic queries.
DeepCRE: Transforming Drug R&D via AI-Driven Cross-drug Response Evaluation.
EN: The fields of therapeutic application and drug research and development (R&D) both face substantial challenges, i.e., the therapeutic domain calls for more treatment alternatives, while numerous promising pre-clinical drugs have failed in clinical trials. One of the reasons is the inadequacy of Cross-drug Response Evaluation (CRE) during the late stages of drug R&D. Although in-silico CRE models bring a promising solution, existing methodologies are restricted to early stages of drug R&D, such as target and cell-line levels, offering limited improvement to clinical success rates. Herein, we introduce DeepCRE, a pioneering AI model designed to predict CRE effectively in the late stages of drug R&D. DeepCRE outperforms the existing best models by achieving an average performance improvement of 17.7% in patient-level CRE, and a 5-fold increase in indication-level CRE, facilitating more accurate personalized treatment predictions and better pharmaceutical value assessment for indications, respectively. Furthermore, DeepCRE has identified a set of six drug candidates that show significantly greater effectiveness than a comparator set of two approved drugs in 5/8 colorectal cancer organo...
Bridging Diversity and Uncertainty in Active learning with Self-Supervised Pre-Training.
EN: This study addresses the integration of diversity-based and uncertainty-based sampling strategies in active learning, particularly within the context of self-supervised pre-trained models. We introduce a straightforward heuristic called TCM that mitigates the cold start problem while maintaining strong performance across various data levels. By initially applying TypiClust for diversity sampling and subsequently transitioning to uncertainty sampling with Margin, our approach effectively combines the strengths of both strategies. Our experiments demonstrate that TCM consistently outperforms existing methods across various datasets in both low and high data regimes.
Generative Active Learning with Variational Autoencoder for Radiology Data Generation in Veterinary Medicine.
EN: Recently, with increasing interest in pet healthcare, the demand for computer-aided diagnosis (CAD) systems in veterinary medicine has increased. The development of veterinary CAD has stagnated due to a lack of sufficient radiology data. To overcome the challenge, we propose a generative active learning framework based on a variational autoencoder. This approach aims to alleviate the scarcity of reliable data for CAD systems in veterinary medicine. This study utilizes datasets comprising cardiomegaly radiograph data. After removing annotations and standardizing images, we employed a framework for data augmentation, which consists of a data generation phase and a query phase for filtering the generated data. The experimental results revealed that as the data generated through this framework was added to the training data of the generative model, the frechet inception distance consistently decreased from 84.14 to 50.75 on the radiograph. Subsequently, when the generated data were incorporated into the training of the classification model, the false positive of the confusion matrix also improved from 0.16 to 0.66 on the radiograph. The proposed framework has the potential to address t...
From Noise to Signal: Unveiling Treatment Effects from Digital Health Data through Pharmacology-Informed Neural-SDE.
EN: Digital health technologies (DHT), such as wearable devices, provide personalized, continuous, and real-time monitoring of patient. These technologies are contributing to the development of novel therapies and personalized medicine. Gaining insight from these technologies requires appropriate modeling techniques to capture clinically-relevant changes in disease state. The data generated from these devices is characterized by being stochastic in nature, may have missing elements, and exhibits considerable inter-individual variability - thereby making it difficult to analyze using traditional longitudinal modeling techniques. We present a novel pharmacology-informed neural stochastic differential equation (SDE) model capable of addressing these challenges. Using synthetic data, we demonstrate that our approach is effective in identifying treatment effects and learning causal relationships from stochastic data, thereby enabling counterfactual simulation.
Rethinking Specificity in SBDD: Leveraging Delta Score and Energy-Guided Diffusion.
EN: In the field of Structure-based Drug Design (SBDD), deep learning-based generative models have achieved outstanding performance in terms of docking score. However, further study shows that the existing molecular generative methods and docking scores both have lacked consideration in terms of specificity, which means that generated molecules bind to almost every protein pocket with high affinity. To address this, we introduce the Delta Score, a new metric for evaluating the specificity of molecular binding. To further incorporate this insight for generation, we develop an innovative energy-guided approach using contrastive learning, with active compounds as decoys, to direct generative models toward creating molecules with high specificity. Our empirical results show that this method not only enhances the delta score but also maintains or improves traditional docking scores, successfully bridging the gap between SBDD and real-world needs.
Spectral Operator Representations.
EN: Machine learning in atomistic materials science has grown to become a powerful tool, with most approaches focusing on atomic arrangements, typically decomposed into local atomic environments. This approach, while well-suited for machine-learned interatomic potentials, is conceptually at odds with learning complex intrinsic properties of materials, often driven by spectral properties commonly represented in reciprocal space (e.g., band gaps or mobilities) which cannot be readily atomically partitioned. For such applications, methods which represent the electronic rather than the atomic structure could be more promising. In this work, we present a general framework focused on electronic-structure descriptors which take advantage of the natural symmetries and inherent interpretability of physical models. Using this framework, we formulate two such representations and apply them respectively to measuring the similarity of carbon nanotubes and barium titanate polymorphs, and to the discovery of novel transparent conducting materials (TCMs) in the Materials Cloud 3D database (MC3D). A random forest classifier trained on 1% of the materials in the MC3D is able to correctly label 76% of en...
When SMILES have Language: Drug Classification using Text Classification Methods on Drug SMILES Strings.
EN: Complex chemical structures, like drugs, are usually defined by SMILES strings as a sequence of molecules and bonds. These SMILES strings are used in different complex machine learning-based drug-related research and representation works. Escaping from complex representation, in this work, we pose a single question: What if we treat drug SMILES as conventional sentences and engage in text classification for drug classification? Our experiments affirm the possibility with very competitive scores. The study explores the notion of viewing each atom and bond as sentence components, employing basic NLP methods to categorize drug types, proving that complex problems can also be solved with simpler perspectives. The data and code are available here: https://github.com/azminewasi/Drug-Classification-NLP.
Reverse Degree-Based Topological Indices and QSPR Analysis of Cancer Drugs.
EN: A topological index of a graph $G$ is a numerical quantity that describes its topology. Reverse degree-based topological indices play an important role in finding topological descriptors. Azacitidine, Decitabine, and Guadecitabine are hypomethylating agents which are used for the treatment of patients with higher-risk myelodysplastic syndromes, acute myeloid leukemia, and chronic myelomonocytic leukemia which are not suitable for in-depth treatments such as induction chemotherapy. In this article, some reverse degree-based topological indices of the three said drugs are computed. Furthermore, QSPR analysis of the said topological indices is discussed and it is shown that these topological indices are highly correlated with the physical properties of the three cancer drugs. These findings may help chemists and people working in the pharmaceutical industry to predict the properties of cancer drugs without experimenting.
Towards Prebiotic Chemistry on Titan: Impact experiments on organic haze particles.
EN: Impacts are critical to producing the aqueous environments necessary to stimulate prebiotic chemistry on Titan's surface. Furthermore, organic hazes resting on the surface are a likely feedstock of biomolecules. In this work, we conduct impact experiments on laboratory-produced organic haze particles and haze/sand mixtures and analyze these samples for life's building blocks. Samples of unshocked haze and sand particles are also analyzed to determine the change in biomolecule concentrations and distributions from shocking. Across all samples, we detect seven nucleobases, nine proteinogenic amino acids, and five other biomolecules (e.g., urea) using a blank subtraction procedure to eliminate signals due to contamination. We find that shock pressures of 13 GPa variably degrade nucleobases, amino acids, and a few other organics in haze particles and haze/sand mixtures; however, certain individual biomolecules become enriched or are even produced from these events. Xanthine, threonine, and aspartic acid are enriched or produced in impact experiments containing sand, suggesting these minerals may catalyze the production of these biomolecules. On the other hand, thymine and isoleucine/no...
Deep Confident Steps to New Pockets: Strategies for Docking Generalization.
EN: Accurate blind docking has the potential to lead to new biological breakthroughs, but for this promise to be realized, docking methods must generalize well across the proteome. Existing benchmarks, however, fail to rigorously assess generalizability. Therefore, we develop DockGen, a new benchmark based on the ligand-binding domains of proteins, and we show that existing machine learning-based docking models have very weak generalization abilities. We carefully analyze the scaling laws of ML-based docking and show that, by scaling data and model size, as well as integrating synthetic data strategies, we are able to significantly increase the generalization capacity and set new state-of-the-art performance across benchmarks. Further, we propose Confidence Bootstrapping, a new training paradigm that solely relies on the interaction between diffusion and confidence models and exploits the multi-resolution generation process of diffusion models. We demonstrate that Confidence Bootstrapping significantly improves the ability of ML-based docking methods to dock to unseen protein classes, edging closer to accurate and generalizable blind docking methods.
Deep Sensitivity Analysis for Objective-Oriented Combinatorial Optimization.
EN: Pathogen control is a critical aspect of modern poultry farming, providing important benefits for both public health and productivity. Effective poultry management measures to reduce pathogen levels in poultry flocks promote food safety by lowering risks of food-borne illnesses. They also support animal health and welfare by preventing infectious diseases that can rapidly spread and impact flock growth, egg production, and overall health. This study frames the search for optimal management practices that minimize the presence of multiple pathogens as a combinatorial optimization problem. Specifically, we model the various possible combinations of management settings as a solution space that can be efficiently explored to identify configurations that optimally reduce pathogen levels. This design incorporates a neural network feedback-based method that combines feature explanations with global sensitivity analysis to ensure combinatorial optimization in multiobjective settings. Our preliminary experiments have promising results when applied to two real-world agricultural datasets. While further validation is still needed, these early experimental findings demonstrate the potential of...
DecompDiff: Diffusion Models with Decomposed Priors for Structure-Based Drug Design.
EN: Designing 3D ligands within a target binding site is a fundamental task in drug discovery. Existing structured-based drug design methods treat all ligand atoms equally, which ignores different roles of atoms in the ligand for drug design and can be less efficient for exploring the large drug-like molecule space. In this paper, inspired by the convention in pharmaceutical practice, we decompose the ligand molecule into two parts, namely arms and scaffold, and propose a new diffusion model, DecompDiff, with decomposed priors over arms and scaffold. In order to facilitate the decomposed generation and improve the properties of the generated molecules, we incorporate both bond diffusion in the model and additional validity guidance in the sampling phase. Extensive experiments on CrossDocked2020 show that our approach achieves state-of-the-art performance in generating high-affinity molecules while maintaining proper molecular properties and conformational stability, with up to -8.39 Avg. Vina Dock score and 24.5 Success Rate. The code is provided at https://github.com/bytedance/DecompDiff
Phonetic and Lexical Discovery of a Canine Language using HuBERT.
EN: This paper delves into the pioneering exploration of potential communication patterns within dog vocalizations and transcends traditional linguistic analysis barriers, which heavily relies on human priori knowledge on limited datasets to find sound units in dog vocalization. We present a self-supervised approach with HuBERT, enabling the accurate classification of phoneme labels and the identification of vocal patterns that suggest a rudimentary vocabulary within dog vocalizations. Our findings indicate a significant acoustic consistency in these identified canine vocabulary, covering the entirety of observed dog vocalization sequences. We further develop a web-based dog vocalization labeling system. This system can highlight phoneme n-grams, present in the vocabulary, in the dog audio uploaded by users.
Closing the AI generalization gap by adjusting for dermatology condition distribution differences across clinical settings.
EN: Recently, there has been great progress in the ability of artificial intelligence (AI) algorithms to classify dermatological conditions from clinical photographs. However, little is known about the robustness of these algorithms in real-world settings where several factors can lead to a loss of generalizability. Understanding and overcoming these limitations will permit the development of generalizable AI that can aid in the diagnosis of skin conditions across a variety of clinical settings. In this retrospective study, we demonstrate that differences in skin condition distribution, rather than in demographics or image capture mode are the main source of errors when an AI algorithm is evaluated on data from a previously unseen source. We demonstrate a series of steps to close this generalization gap, requiring progressively more information about the new source, ranging from the condition distribution to training data enriched for data less frequently seen during training. Our results also suggest comparable performance from end-to-end fine tuning versus fine tuning solely the classification layer on top of a frozen embedding model. Our approach can inform the adaptation of AI algo...
Structure-Based Drug Design via 3D Molecular Generative Pre-training and Sampling.
EN: Structure-based drug design aims at generating high affinity ligands with prior knowledge of 3D target structures. Existing methods either use conditional generative model to learn the distribution of 3D ligands given target binding sites, or iteratively modify molecules to optimize a structure-based activity estimator. The former is highly constrained by data quantity and quality, which leaves optimization-based approaches more promising in practical scenario. However, existing optimization-based approaches choose to edit molecules in 2D space, and use molecular docking to estimate the activity using docking predicted 3D target-ligand complexes. The misalignment between the action space and the objective hinders the performance of these models, especially for those employ deep learning for acceleration. In this work, we propose MolEdit3D to combine 3D molecular generation with optimization frameworks. We develop a novel 3D graph editing model to generate molecules using fragments, and pre-train this model on abundant 3D ligands for learning target-independent properties. Then we employ a target-guided self-learning strategy to improve target-related properties using self-sampled m...
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain.
EN: The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-th...
DrBenchmark: A Large Language Understanding Evaluation Benchmark for French Biomedical Domain.
EN: The biomedical domain has sparked a significant interest in the field of Natural Language Processing (NLP), which has seen substantial advancements with pre-trained language models (PLMs). However, comparing these models has proven challenging due to variations in evaluation protocols across different models. A fair solution is to aggregate diverse downstream tasks into a benchmark, allowing for the assessment of intrinsic PLMs qualities from various perspectives. Although still limited to few languages, this initiative has been undertaken in the biomedical field, notably English and Chinese. This limitation hampers the evaluation of the latest French biomedical models, as they are either assessed on a minimal number of tasks with non-standardized protocols or evaluated using general downstream tasks. To bridge this research gap and account for the unique sensitivities of French, we present the first-ever publicly available French biomedical language understanding benchmark called DrBenchmark. It encompasses 20 diversified tasks, including named-entity recognition, part-of-speech tagging, question-answering, semantic textual similarity, and classification. We evaluate 8 state-of-th...
Molecule Generation and Optimization for Efficient Fragrance Creation.
EN: This research introduces a Machine Learning-centric approach to replicate olfactory experiences, validated through experimental quantification of perfume perception. Key contributions encompass a hybrid model connecting perfume molecular structure to human olfactory perception. This model includes an AI-driven molecule generator (utilizing Graph and Generative Neural Networks), quantification and prediction of odor intensity, and refinery of optimal solvent and molecule combinations for desired fragrances. Additionally, a thermodynamic-based model establishes a link between olfactory perception and liquid-phase concentrations. The methodology employs Transfer Learning and selects the most suitable molecules based on vapor pressure and fragrance notes. Ultimately, a mathematical optimization problem is formulated to minimize discrepancies between new and target olfactory experiences. The methodology is validated by reproducing two distinct olfactory experiences using available experimental data.
Re-Dock: Towards Flexible and Realistic Molecular Docking with Diffusion Bridge.
EN: Accurate prediction of protein-ligand binding structures, a task known as molecular docking is crucial for drug design but remains challenging. While deep learning has shown promise, existing methods often depend on holo-protein structures (docked, and not accessible in realistic tasks) or neglect pocket sidechain conformations, leading to limited practical utility and unrealistic conformation predictions. To fill these gaps, we introduce an under-explored task, named flexible docking to predict poses of ligand and pocket sidechains simultaneously and introduce Re-Dock, a novel diffusion bridge generative model extended to geometric manifolds. Specifically, we propose energy-to-geometry mapping inspired by the Newton-Euler equation to co-model the binding energy and conformations for reflecting the energy-constrained docking generative process. Comprehensive experiments on designed benchmark datasets including apo-dock and cross-dock demonstrate our model's superior effectiveness and efficiency over current methods.
SusFL: Energy-Aware Federated Learning-based Monitoring for Sustainable Smart Farms.
EN: We propose a novel energy-aware federated learning (FL)-based system, namely SusFL, for sustainable smart farming to address the challenge of inconsistent health monitoring due to fluctuating energy levels of solar sensors. This system equips animals, such as cattle, with solar sensors with computational capabilities, including Raspberry Pis, to train a local deep-learning model on health data. These sensors periodically update Long Range (LoRa) gateways, forming a wireless sensor network (WSN) to detect diseases like mastitis. Our proposed SusFL system incorporates mechanism design, a game theory concept, for intelligent client selection to optimize monitoring quality while minimizing energy use. This strategy ensures the system's sustainability and resilience against adversarial attacks, including data poisoning and privacy threats, that could disrupt FL operations. Through extensive comparative analysis using real-time datasets, we demonstrate that our FL-based monitoring system significantly outperforms existing methods in prediction accuracy, operational efficiency, system reliability (i.e., mean time between failures or MTBF), and social welfare maximization by the mechanism ...
Convolutional Neural Networks Towards Facial Skin Lesions Detection.
EN: Facial analysis has emerged as a prominent area of research with diverse applications, including cosmetic surgery programs, the beauty industry, photography, and entertainment. Manipulating patient images often necessitates professional image processing software. This study contributes by providing a model that facilitates the detection of blemishes and skin lesions on facial images through a convolutional neural network and machine learning approach. The proposed method offers advantages such as simple architecture, speed and suitability for image processing while avoiding the complexities associated with traditional methods. The model comprises four main steps: area selection, scanning the chosen region, lesion diagnosis, and marking the identified lesion. Raw data for this research were collected from a reputable clinic in Tehran specializing in skincare and beauty services. The dataset includes administrative information, clinical data, and facial and profile images. A total of 2300 patient images were extracted from this raw data. A software tool was developed to crop and label lesions, with input from two treatment experts. In the lesion preparation phase, the selected area w...
Reducing model complexity by means of the Optimal Scaling: Population Balance Model for latex particles morphology formation.
EN: Rational computer-aided design of multiphase polymer materials is vital for rapid progress in many important applications, such as: diagnostic tests, drug delivery, coatings, additives for constructing materials, cosmetics, etc. Several property predictive models, including the prospective Population Balance Model for Latex Particles Morphology Formation (LPMF PBM), have already been developed for such materials. However, they lack computational efficiency, and the accurate prediction of materials' properties still remains a great challenge. To enhance performance of the LPMF PBM, we explore the feasibility of reducing its complexity through disregard of the aggregation terms of the model. The introduced nondimensionalization approach, which we call Optimal Scaling with Constraints, suggests a quantitative criterion for locating regions of slow and fast aggregation and helps to derive a family of dimensionless LPMF PBM of reduced complexity. The mathematical analysis of this new family is also provided. When compared with the original LPMF PBM, the resulting models demonstrate several orders of magnitude better computational efficiency.
Exact capacity of the \emph{wide} hidden layer treelike neural networks with generic activations.
EN: Recent progress in studying \emph{treelike committee machines} (TCM) neural networks (NN) in \cite{Stojnictcmspnncaprdt23,Stojnictcmspnncapliftedrdt23,Stojnictcmspnncapdiffactrdt23} showed that the Random Duality Theory (RDT) and its a \emph{partially lifted}(pl RDT) variant are powerful tools that can be used for very precise networks capacity analysis. Here, we consider \emph{wide} hidden layer networks and uncover that certain aspects of numerical difficulties faced in \cite{Stojnictcmspnncapdiffactrdt23} miraculously disappear. In particular, we employ recently developed \emph{fully lifted} (fl) RDT to characterize the \emph{wide} ($d\rightarrow \infty$) TCM nets capacity. We obtain explicit, closed form, capacity characterizations for a very generic class of the hidden layer activations. While the utilized approach significantly lowers the amount of the needed numerical evaluations, the ultimate fl RDT usefulness and success still require a solid portion of the residual numerical work. To get the concrete capacity values, we take four very famous activations examples: \emph{\textbf{ReLU}}, \textbf{\emph{quadratic}}, \textbf{\emph{erf}}, and \textbf{\emph{tanh}}. After successf...
Controlling flow patterns and topology in active emulsions.
EN: Active emulsions and liquid crystalline shells are intriguing and experimentally realisable types of topological matter. Here we numerically study the morphology and spatiotemporal dynamics of a double emulsion, where one or two passive small droplets are embedded in a larger active droplet. We find activity introduces a variety of rich and nontrivial nonequilibrium states in the system. First, a double emulsion with a single active droplet becomes self-motile, and there is a transition between translational and rotational motion: both of these regimes remain defect-free, hence topologically trivial. Second, a pair of particles nucleate one or more disclination loops, with conformational dynamics resembling a rotor or chaotic oscillator, accessed by tuning activity. In the first state a single, topologically charged, disclination loop powers the rotation. In the latter state, this disclination stretches and writhes in 3D, continuously undergoing recombination to yield an example of an active living polymer. These emulsions can be self-assembled in the lab, and provide a pathway to form flow and topology patterns in active matter in a controllable way, as opposed to bulk systems tha...
From Words to Molecules: A Survey of Large Language Models in Chemistry.
EN: In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interp...
From Words to Molecules: A Survey of Large Language Models in Chemistry.
EN: In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interp...
From Words to Molecules: A Survey of Large Language Models in Chemistry.
EN: In recent years, Large Language Models (LLMs) have achieved significant success in natural language processing (NLP) and various interdisciplinary areas. However, applying LLMs to chemistry is a complex task that requires specialized domain knowledge. This paper provides a thorough exploration of the nuanced methodologies employed in integrating LLMs into the field of chemistry, delving into the complexities and innovations at this interdisciplinary juncture. Specifically, our analysis begins with examining how molecular information is fed into LLMs through various representation and tokenization methods. We then categorize chemical LLMs into three distinct groups based on the domain and modality of their input data, and discuss approaches for integrating these inputs for LLMs. Furthermore, this paper delves into the pretraining objectives with adaptations to chemical LLMs. After that, we explore the diverse applications of LLMs in chemistry, including novel paradigms for their application in chemistry tasks. Finally, we identify promising research directions, including further integration with chemical knowledge, advancements in continual learning, and improvements in model interp...
DoubleMLDeep: Estimation of Causal Effects with Multimodal Data.
EN: This paper explores the use of unstructured, multimodal data, namely text and images, in causal inference and treatment effect estimation. We propose a neural network architecture that is adapted to the double machine learning (DML) framework, specifically the partially linear model. An additional contribution of our paper is a new method to generate a semi-synthetic dataset which can be used to evaluate the performance of causal effect estimation in the presence of text and images as confounders. The proposed methods and architectures are evaluated on the semi-synthetic dataset and compared to standard approaches, highlighting the potential benefit of using text and images directly in causal studies. Our findings have implications for researchers and practitioners in economics, marketing, finance, medicine and data science in general who are interested in estimating causal quantities using non-traditional data.
Hierarchical Multi-Label Classification of Online Vaccine Concerns.
EN: Vaccine concerns are an ever-evolving target, and can shift quickly as seen during the COVID-19 pandemic. Identifying longitudinal trends in vaccine concerns and misinformation might inform the healthcare space by helping public health efforts strategically allocate resources or information campaigns. We explore the task of detecting vaccine concerns in online discourse using large language models (LLMs) in a zero-shot setting without the need for expensive training datasets. Since real-time monitoring of online sources requires large-scale inference, we explore cost-accuracy trade-offs of different prompting strategies and offer concrete takeaways that may inform choices in system designs for current applications. An analysis of different prompting strategies reveals that classifying the concerns over multiple passes through the LLM, each consisting a boolean question whether the text mentions a vaccine concern or not, works the best. Our results indicate that GPT-4 can strongly outperform crowdworker accuracy when compared to ground truth annotations provided by experts on the recently introduced VaxConcerns dataset, achieving an overall F1 score of 78.7%.
Optimal vaccination strategies on networks and in metropolitan areas.
EN: This study presents a mathematical model for optimal vaccination strategies in interconnected metropolitan areas, considering commuting patterns. It is a compartmental model with a vaccination rate for each city, acting as a control function. The commuting patterns are incorporated through a weighted adjacency matrix and a parameter that selects day and night periods. The optimal control problem is formulated to minimize a functional cost that balances the number of hospitalizations and vaccines, including restrictions of a weekly availability cap and an application capacity of vaccines per unit of time. The key findings of this work are bounds for the basic reproduction number, particularly in the case of a metropolitan area, and the study of the optimal control problem. Theoretical analysis and numerical simulations provide insights into disease dynamics and the effectiveness of control measures. The research highlights the importance of prioritizing vaccination in the capital to better control the disease spread, as we depicted in our numerical simulations. This model serves as a tool to improve resource allocation in epidemic control across metropolitan regions.
Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets.
EN: The remarkable progress of deep learning in dermatological tasks has brought us closer to achieving diagnostic accuracies comparable to those of human experts. However, while large datasets play a crucial role in the development of reliable deep neural network models, the quality of data therein and their correct usage are of paramount importance. Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition. In this paper, we conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k, uncovering these data quality issues, measure the effects of these problems on the benchmark results, and propose corrections to the datasets. Besides ensuring the reproducibility of our analysis, by making our analysis pipeline and the accompanying code publicly available, we aim to encourage similar explorations and to facilitate the identification and addressing of potential data quality issues in other large datasets.
DeepRLI: A Multi-objective Framework for Universal Protein--Ligand Interaction Prediction.
EN: Protein (receptor)--ligand interaction prediction is a critical component in computer-aided drug design, significantly influencing molecular docking and virtual screening processes. Despite the development of numerous scoring functions in recent years, particularly those employing machine learning, accurately and efficiently predicting binding affinities for protein--ligand complexes remains a formidable challenge. Most contemporary methods are tailored for specific tasks, such as binding affinity prediction, binding pose prediction, or virtual screening, often failing to encompass all aspects. In this study, we put forward DeepRLI, a novel protein--ligand interaction prediction architecture. It encodes each protein--ligand complex into a fully connected graph, retaining the integrity of the topological and spatial structure, and leverages the improved graph transformer layers with cosine envelope as the central module of the neural network, thus exhibiting superior scoring power. In order to equip the model to generalize to conformations beyond the confines of crystal structures and to adapt to molecular docking and virtual screening tasks, we propose a multi-objective strategy, t...
Influence of particle size polydispersity on dynamical heterogeneities in dense particle packings.
EN: The dynamics of dense particle packings near the jamming transition is characterized by correlated particle motion. The growth of dynamical heterogeneities, or strong spatial variations in the motion of the particles constituting the system, is a hallmark feature of slow glassy dynamics. We report here a systematic confocal microscopy study that characterizes the cooperative dynamics of fluorescently-labelled colloidal particles in dense aqueous suspensions. We demonstrate that jammed particulate suspensions can be fluidized by increasing the width of the particle size distribution. Our molecular dynamics simulations, performed to numerically investigate the effects of continuous-size polydispersity on dense particle packing dynamics, show an excellent match with our experimental results. Besides shedding light on the fundamental aspects of particle-scale dynamics at the jamming-unjamming transition, our findings are significant in the processing of commonly-encountered dense suspensions such as paints, cosmetics, and food.
Rigid Protein-Protein Docking via Equivariant Elliptic-Paraboloid Interface Prediction.
EN: The study of rigid protein-protein docking plays an essential role in a variety of tasks such as drug design and protein engineering. Recently, several learning-based methods have been proposed for the task, exhibiting much faster docking speed than those computational methods. In this paper, we propose a novel learning-based method called ElliDock, which predicts an elliptic paraboloid to represent the protein-protein docking interface. To be specific, our model estimates elliptic paraboloid interfaces for the two input proteins respectively, and obtains the roto-translation transformation for docking by making two interfaces coincide. By its design, ElliDock is independently equivariant with respect to arbitrary rotations/translations of the proteins, which is an indispensable property to ensure the generalization of the docking process. Experimental evaluations show that ElliDock achieves the fastest inference time among all compared methods and is strongly competitive with current state-of-the-art learning-based models such as DiffDock-PP and Multimer particularly for antibody-antigen docking.
Empirical Evidence for the Fragment level Understanding on Drug Molecular Structure of LLMs.
EN: AI for drug discovery has been a research hotspot in recent years, and SMILES-based language models has been increasingly applied in drug molecular design. However, no work has explored whether and how language models understand the chemical spatial structure from 1D sequences. In this work, we pre-train a transformer model on chemical language and fine-tune it toward drug design objectives, and investigate the correspondence between high-frequency SMILES substrings and molecular fragments. The results indicate that language models can understand chemical structures from the perspective of molecular fragments, and the structural knowledge learned through fine-tuning is reflected in the high-frequency SMILES substrings generated by the model.
Binding-Adaptive Diffusion Models for Structure-Based Drug Design.
EN: Structure-based drug design (SBDD) aims to generate 3D ligand molecules that bind to specific protein targets. Existing 3D deep generative models including diffusion models have shown great promise for SBDD. However, it is complex to capture the essential protein-ligand interactions exactly in 3D space for molecular generation. To address this problem, we propose a novel framework, namely Binding-Adaptive Diffusion Models (BindDM). In BindDM, we adaptively extract subcomplex, the essential part of binding sites responsible for protein-ligand interactions. Then the selected protein-ligand subcomplex is processed with SE(3)-equivariant neural networks, and transmitted back to each atom of the complex for augmenting the target-aware 3D molecule diffusion generation with binding interaction information. We iterate this hierarchical complex-subcomplex process with cross-hierarchy interaction node for adequately fusing global binding context between the complex and its corresponding subcomplex. Empirical studies on the CrossDocked2020 dataset show BindDM can generate molecules with more realistic 3D structures and higher binding affinities towards the protein targets, with up to -5.92 Av...
BioDiffusion: A Versatile Diffusion Model for Biomedical Signal Synthesis.
EN: Machine learning tasks involving biomedical signals frequently grapple with issues such as limited data availability, imbalanced datasets, labeling complexities, and the interference of measurement noise. These challenges often hinder the optimal training of machine learning algorithms. Addressing these concerns, we introduce BioDiffusion, a diffusion-based probabilistic model optimized for the synthesis of multivariate biomedical signals. BioDiffusion demonstrates excellence in producing high-fidelity, non-stationary, multivariate signals for a range of tasks including unconditional, label-conditional, and signal-conditional generation. Leveraging these synthesized signals offers a notable solution to the aforementioned challenges. Our research encompasses both qualitative and quantitative assessments of the synthesized data quality, underscoring its capacity to bolster accuracy in machine learning tasks tied to biomedical signals. Furthermore, when juxtaposed with current leading time-series generative models, empirical evidence suggests that BioDiffusion outperforms them in biomedical signal generation quality.
BioDiffusion: A Versatile Diffusion Model for Biomedical Signal Synthesis.
EN: Machine learning tasks involving biomedical signals frequently grapple with issues such as limited data availability, imbalanced datasets, labeling complexities, and the interference of measurement noise. These challenges often hinder the optimal training of machine learning algorithms. Addressing these concerns, we introduce BioDiffusion, a diffusion-based probabilistic model optimized for the synthesis of multivariate biomedical signals. BioDiffusion demonstrates excellence in producing high-fidelity, non-stationary, multivariate signals for a range of tasks including unconditional, label-conditional, and signal-conditional generation. Leveraging these synthesized signals offers a notable solution to the aforementioned challenges. Our research encompasses both qualitative and quantitative assessments of the synthesized data quality, underscoring its capacity to bolster accuracy in machine learning tasks tied to biomedical signals. Furthermore, when juxtaposed with current leading time-series generative models, empirical evidence suggests that BioDiffusion outperforms them in biomedical signal generation quality.
Crowding-Regulated Binding of Divalent Biomolecules.
EN: Macromolecular crowding affects biophysical processes as diverse as diffusion, gene expression, cell growth, and senescence. Yet, there is no comprehensive understanding of how crowding affects reactions, particularly multivalent binding. Herein, we use scaled particle theory and develop a molecular simulation method to investigate the binding of monovalent to divalent biomolecules. We find that crowding can increase or reduce cooperativity--the extent to which the binding of a second molecule is enhanced after binding a first molecule--by orders of magnitude, depending on the sizes of the involved molecular complexes. Cooperativity generally increases when a divalent molecule swells and then shrinks upon binding two ligands. Our calculations also reveal that, in some cases, crowding enables binding that does not occur otherwise. As an immunological example, we consider Immunoglobulin G-antigen binding and show that crowding enhances its cooperativity in bulk but reduces it when an Immunoglobulin G binds antigens on a surface.
Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-based Non-invasive Digital System.
EN: Skin cancer is a global health concern, necessitating early and accurate diagnosis for improved patient outcomes. This study introduces a groundbreaking approach to skin cancer classification, employing the Vision Transformer, a state-of-the-art deep learning architecture renowned for its success in diverse image analysis tasks. Utilizing the HAM10000 dataset of 10,015 meticulously annotated skin lesion images, the model undergoes preprocessing for enhanced robustness. The Vision Transformer, adapted to the skin cancer classification task, leverages the self-attention mechanism to capture intricate spatial dependencies, achieving superior performance over traditional deep learning architectures. Segment Anything Model aids in precise segmentation of cancerous areas, attaining high IOU and Dice Coefficient. Extensive experiments highlight the model's supremacy, particularly the Google-based ViT patch-32 variant, which achieves 96.15% accuracy and showcases potential as an effective tool for dermatologists in skin cancer diagnosis, contributing to advancements in dermatological practices.
A Hybrid Quantum Computing Pipeline for Real World Drug Discovery.
EN: Quantum computing, with its superior computational capabilities compared to classical approaches, holds the potential to revolutionize numerous scientific domains, including pharmaceuticals. However, the application of quantum computing for drug discovery has primarily been limited to proof-of-concept studies, which often fail to capture the intricacies of real-world drug development challenges. In this study, we diverge from conventional investigations by developing \rev{a hybrid} quantum computing pipeline tailored to address genuine drug design problems. Our approach underscores the application of quantum computation in drug discovery and propels it towards more scalable system. We specifically construct our versatile quantum computing pipeline to address two critical tasks in drug discovery: the precise determination of Gibbs free energy profiles for prodrug activation involving covalent bond cleavage, and the accurate simulation of covalent bond interactions. This work serves as a pioneering effort in benchmarking quantum computing against veritable scenarios encountered in drug design, especially the covalent bonding issue present in both of the case studies, thereby transiti...
Can Large Language Models Understand Molecules?.
EN: Purpose: Large Language Models (LLMs) like GPT (Generative Pre-trained Transformer) from OpenAI and LLaMA (Large Language Model Meta AI) from Meta AI are increasingly recognized for their potential in the field of cheminformatics, particularly in understanding Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs also have the ability to decode SMILES strings into vector representations. Method: We investigate the performance of GPT and LLaMA compared to pre-trained models on SMILES in embedding SMILES strings on downstream tasks, focusing on two key applications: molecular property prediction and drug-drug interaction prediction. Results: We find that SMILES embeddings generated using LLaMA outperform those from GPT in both molecular property and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to pre-trained models on SMILES in molecular prediction tasks and outperform the pre-trained models for the DDI prediction tasks. Conclusion: The performance of LLMs in generating SMILES embeddings shows great potential for further investigation of these models for molecular embeddi...
Deep Learning model predicts the c-Kit-11 mutational status of canine cutaneous mast cell tumors by HE stained histological slides.
EN: Numerous prognostic factors are currently assessed histopathologically in biopsies of canine mast cell tumors to evaluate clinical behavior. In addition, PCR analysis of the c-Kit exon 11 mutational status is often performed to evaluate the potential success of a tyrosine kinase inhibitor therapy. This project aimed at training deep learning models (DLMs) to identify the c-Kit-11 mutational status of MCTs solely based on morphology without additional molecular analysis. HE slides of 195 mutated and 173 non-mutated tumors were stained consecutively in two different laboratories and scanned with three different slide scanners. This resulted in six different datasets (stain-scanner variations) of whole slide images. DLMs were trained with single and mixed datasets and their performances was assessed under scanner and staining domain shifts. The DLMs correctly classified HE slides according to their c-Kit 11 mutation status in, on average, 87% of cases for the best-suited stain-scanner variant. A relevant performance drop could be observed when the stain-scanner combination of the training and test dataset differed. Multi-variant datasets improved the average accuracy but did not reach...
Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing.
EN: Large Language Models (LLMs), particularly those similar to ChatGPT, have significantly influenced the field of Natural Language Processing (NLP). While these models excel in general language tasks, their performance in domain-specific downstream tasks such as biomedical and clinical Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI) is still evolving. In this context, our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples. This dataset represents a carefully curated compilation of existing data, meticulously adapted and reformatted to align with the specific requirements of our instruction-based tasks. This initiative represents an important step in utilising such models to achieve results on par with specialised encoder-only models like BioBERT and BioClinicalBERT for various classical biomedical NLP tasks. Our work includes an analysis of the dataset's composition and its impact on mo...
Exploring the Effectiveness of Instruction Tuning in Biomedical Language Processing.
EN: Large Language Models (LLMs), particularly those similar to ChatGPT, have significantly influenced the field of Natural Language Processing (NLP). While these models excel in general language tasks, their performance in domain-specific downstream tasks such as biomedical and clinical Named Entity Recognition (NER), Relation Extraction (RE), and Medical Natural Language Inference (NLI) is still evolving. In this context, our study investigates the potential of instruction tuning for biomedical language processing, applying this technique to two general LLMs of substantial scale. We present a comprehensive, instruction-based model trained on a dataset that consists of approximately $200,000$ instruction-focused samples. This dataset represents a carefully curated compilation of existing data, meticulously adapted and reformatted to align with the specific requirements of our instruction-based tasks. This initiative represents an important step in utilising such models to achieve results on par with specialised encoder-only models like BioBERT and BioClinicalBERT for various classical biomedical NLP tasks. Our work includes an analysis of the dataset's composition and its impact on mo...
Quantum state tracking and control of a single molecular ion in a thermal environment.
EN: Understanding molecular state evolution is central to many disciplines, including molecular dynamics, precision measurement, and molecule-based quantum technology. Details of the evolution are obscured when observing a statistical ensemble of molecules. Here, we reported real-time observations of thermal radiation-driven transitions between individual states ("jumps") of a single molecule. We reversed these "jumps" through microwave-driven transitions, resulting in a twentyfold improvement in the time the molecule dwells in a chosen state. The measured transition rates showed anisotropy in the thermal environment, pointing to the possibility of using single molecules as in-situ probes for the strengths of ambient fields. Our approaches for state detection and manipulation could apply to a wide range of species, facilitating their uses in fields including quantum science, molecular physics, and ion-neutral chemistry.
Quantum state tracking and control of a single molecular ion in a thermal environment.
EN: Understanding molecular state evolution is central to many disciplines, including molecular dynamics, precision measurement, and molecule-based quantum technology. Details of the evolution are obscured when observing a statistical ensemble of molecules. Here, we reported real-time observations of thermal radiation-driven transitions between individual states ("jumps") of a single molecule. We reversed these "jumps" through microwave-driven transitions, resulting in a twentyfold improvement in the time the molecule dwells in a chosen state. The measured transition rates showed anisotropy in the thermal environment, pointing to the possibility of using single molecules as in-situ probes for the strengths of ambient fields. Our approaches for state detection and manipulation could apply to a wide range of species, facilitating their uses in fields including quantum science, molecular physics, and ion-neutral chemistry.
Molecular Property Prediction Based on Graph Structure Learning.
EN: Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this paper we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecular similarity graph (MSG). Following that, we conduct graph structure learning on the MSG (i.e., molecule-level graph structure learning) to get the final molecular embeddings, which are the results of fusing both GNN encoded molecular representations and the relationships among molecules, i.e., combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on seven various benchmark datasets show that our method could achieve state-of-the-art performance in m...
Multi-level biomedical NER through multi-granularity embeddings and enhanced labeling.
EN: Biomedical Named Entity Recognition (NER) is a fundamental task of Biomedical Natural Language Processing for extracting relevant information from biomedical texts, such as clinical records, scientific publications, and electronic health records. The conventional approaches for biomedical NER mainly use traditional machine learning techniques, such as Conditional Random Fields and Support Vector Machines or deep learning-based models like Recurrent Neural Networks and Convolutional Neural Networks. Recently, Transformer-based models, including BERT, have been used in the domain of biomedical NER and have demonstrated remarkable results. However, these models are often based on word-level embeddings, limiting their ability to capture character-level information, which is effective in biomedical NER due to the high variability and complexity of biomedical texts. To address these limitations, this paper proposes a hybrid approach that integrates the strengths of multiple models. In this paper, we proposed an approach that leverages fine-tuned BERT to provide contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLS...
Multi-level biomedical NER through multi-granularity embeddings and enhanced labeling.
EN: Biomedical Named Entity Recognition (NER) is a fundamental task of Biomedical Natural Language Processing for extracting relevant information from biomedical texts, such as clinical records, scientific publications, and electronic health records. The conventional approaches for biomedical NER mainly use traditional machine learning techniques, such as Conditional Random Fields and Support Vector Machines or deep learning-based models like Recurrent Neural Networks and Convolutional Neural Networks. Recently, Transformer-based models, including BERT, have been used in the domain of biomedical NER and have demonstrated remarkable results. However, these models are often based on word-level embeddings, limiting their ability to capture character-level information, which is effective in biomedical NER due to the high variability and complexity of biomedical texts. To address these limitations, this paper proposes a hybrid approach that integrates the strengths of multiple models. In this paper, we proposed an approach that leverages fine-tuned BERT to provide contextualized word embeddings, a pre-trained multi-channel CNN for character-level information capture, and following by a BiLS...
Decoding Concerns: Multi-label Classification of Vaccine Sentiments in Social Media.
EN: In the realm of public health, vaccination stands as the cornerstone for mitigating disease risks and controlling their proliferation. The recent COVID-19 pandemic has highlighted how vaccines play a crucial role in keeping us safe. However the situation involves a mix of perspectives, with skepticism towards vaccines prevailing for various reasons such as political dynamics, apprehensions about side effects, and more. The paper addresses the challenge of comprehensively understanding and categorizing these diverse concerns expressed in the context of vaccination. Our focus is on developing a robust multi-label classifier capable of assigning specific concern labels to tweets based on the articulated apprehensions towards vaccines. To achieve this, we delve into the application of a diverse set of advanced natural language processing techniques and machine learning algorithms including transformer models like BERT, state of the art GPT 3.5, Classifier Chains & traditional methods like SVM, Random Forest, Naive Bayes. We see that the cutting-edge large language model outperforms all other methods in this context.
Capacity of the treelike sign perceptrons neural networks with one hidden layer -- RDT based upper bounds.
EN: We study the capacity of \emph{sign} perceptrons neural networks (SPNN) and particularly focus on 1-hidden layer \emph{treelike committee machine} (TCM) architectures. Similarly to what happens in the case of a single perceptron neuron, it turns out that, in a statistical sense, the capacity of a corresponding multilayered network architecture consisting of multiple \emph{sign} perceptrons also undergoes the so-called phase transition (PT) phenomenon. This means: (i) for certain range of system parameters (size of data, number of neurons), the network can be properly trained to accurately memorize \emph{all} elements of the input dataset; and (ii) outside the region such a training does not exist. Clearly, determining the corresponding phase transition curve that separates these regions is an extraordinary task and among the most fundamental questions related to the performance of any network. Utilizing powerful mathematical engine called Random Duality Theory (RDT), we establish a generic framework for determining the upper bounds on the 1-hidden layer TCM SPNN capacity. Moreover, we do so for \emph{any} given (odd) number of neurons. We further show that the obtained results \emp...
Learning to Denoise Biomedical Knowledge Graph for Robust Molecular Interaction Prediction.
EN: Molecular interaction prediction plays a crucial role in forecasting unknown interactions between molecules, such as drug-target interaction (DTI) and drug-drug interaction (DDI), which are essential in the field of drug discovery and therapeutics. Although previous prediction methods have yielded promising results by leveraging the rich semantics and topological structure of biomedical knowledge graphs (KGs), they have primarily focused on enhancing predictive performance without addressing the presence of inevitable noise and inconsistent semantics. This limitation has hindered the advancement of KG-based prediction methods. To address this limitation, we propose BioKDN (Biomedical Knowledge Graph Denoising Network) for robust molecular interaction prediction. BioKDN refines the reliable structure of local subgraphs by denoising noisy links in a learnable manner, providing a general module for extracting task-relevant interactions. To enhance the reliability of the refined structure, BioKDN maintains consistent and robust semantics by smoothing relations around the target interaction. By maximizing the mutual information between reliable structure and smoothed relations, BioKDN e...
Learning to Denoise Biomedical Knowledge Graph for Robust Molecular Interaction Prediction.
EN: Molecular interaction prediction plays a crucial role in forecasting unknown interactions between molecules, such as drug-target interaction (DTI) and drug-drug interaction (DDI), which are essential in the field of drug discovery and therapeutics. Although previous prediction methods have yielded promising results by leveraging the rich semantics and topological structure of biomedical knowledge graphs (KGs), they have primarily focused on enhancing predictive performance without addressing the presence of inevitable noise and inconsistent semantics. This limitation has hindered the advancement of KG-based prediction methods. To address this limitation, we propose BioKDN (Biomedical Knowledge Graph Denoising Network) for robust molecular interaction prediction. BioKDN refines the reliable structure of local subgraphs by denoising noisy links in a learnable manner, providing a general module for extracting task-relevant interactions. To enhance the reliability of the refined structure, BioKDN maintains consistent and robust semantics by smoothing relations around the target interaction. By maximizing the mutual information between reliable structure and smoothed relations, BioKDN e...
Freezing-induced topological transition of double-emulsion.
EN: Solidification of complex liquids is pertinent to numerous natural and industrial processes. Here, we examine the freezing of a W/O/W double-emulsion, i.e., water-in-oil compound droplets dispersed in water. We show that the solidification of such hierarchical emulsions can trigger a topological transition; for example, in our case, we observe the transition from the stable W/O/W state to a (frozen) O/W single-emulsion configuration. Strikingly, this transition is characterised by sudden expulsion of the inner water drop from the encapsulating oil droplet. We propose that this topological transition is triggered by the freezing of the encapsulating oil droplet from the outside in, putting tension on the inner water drop thus, destabilizing the W/O/W configuration. Using high-speed imaging we characterize the destabilization process. Interestingly, we find that below a critical size of the inner drop, $R_{\mathrm{in,crit}} \approx 19 \, μ\mathrm{m}$, the topological transition does not occur any more and the double-emulsion remains stable, in line with our interpretation.
Protein Language Model-Powered 3D Ligand Binding Site Prediction from Protein Sequence.
EN: Prediction of ligand binding sites of proteins is a fundamental and important task for understanding the function of proteins and screening potential drugs. Most existing methods require experimentally determined protein holo-structures as input. However, such structures can be unavailable on novel or less-studied proteins. To tackle this limitation, we propose LaMPSite, which only takes protein sequences and ligand molecular graphs as input for ligand binding site predictions. The protein sequences are used to retrieve residue-level embeddings and contact maps from the pre-trained ESM-2 protein language model. The ligand molecular graphs are fed into a graph neural network to compute atom-level embeddings. Then we compute and update the protein-ligand interaction embedding based on the protein residue-level embeddings and ligand atom-level embeddings, and the geometric constraints in the inferred protein contact map and ligand distance map. A final pooling on protein-ligand interaction embedding would indicate which residues belong to the binding sites. Without any 3D coordinate information of proteins, our proposed model achieves competitive performance compared to baseline metho...
Livestock feeding behaviour: A review on automated systems for ruminant monitoring.
EN: Livestock feeding behaviour is an influential research area for those involved in animal husbandry and agriculture. In recent years, there has been a growing interest in automated systems for monitoring the behaviour of ruminants. Despite the developments accomplished in the last decade, there is still much to do and learn about the methods for measuring and analysing livestock feeding behaviour. Automated monitoring systems mainly use motion, acoustic, and image sensors to collect animal behavioural data. The performance evaluation of existing methods is a complex task and direct comparisons between studies are difficult. Several factors prevent a direct comparison, starting from the diversity of data and performance metrics used in the experiments. To the best of our knowledge, this work represents the first tutorial-style review on the analysis of the feeding behaviour of ruminants, emphasising the relationship between sensing methodologies, signal processing, and computational intelligence methods. It assesses the main sensing methodologies (i.e. based on movement, sound, images/videos, and pressure) and the main techniques to measure and analyse the signals associated with fee...
Enhancing Ligand Pose Sampling for Molecular Docking.
EN: Deep learning promises to dramatically improve scoring functions for molecular docking, leading to substantial advances in binding pose prediction and virtual screening. To train scoring functions-and to perform molecular docking-one must generate a set of candidate ligand binding poses. Unfortunately, the sampling protocols currently used to generate candidate poses frequently fail to produce any poses close to the correct, experimentally determined pose, unless information about the correct pose is provided. This limits the accuracy of learned scoring functions and molecular docking. Here, we describe two improved protocols for pose sampling: GLOW (auGmented sampLing with sOftened vdW potential) and a novel technique named IVES (IteratiVe Ensemble Sampling). Our benchmarking results demonstrate the effectiveness of our methods in improving the likelihood of sampling accurate poses, especially for binding pockets whose shape changes substantially when different ligands bind. This improvement is observed across both experimentally determined and AlphaFold-generated protein structures. Additionally, we present datasets of candidate ligand poses generated using our methods for each o...
AI in Pharma for Personalized Sequential Decision-Making: Methods, Applications and Opportunities.
EN: In the pharmaceutical industry, the use of artificial intelligence (AI) has seen consistent growth over the past decade. This rise is attributed to major advancements in statistical machine learning methodologies, computational capabilities and the increased availability of large datasets. AI techniques are applied throughout different stages of drug development, ranging from drug discovery to post-marketing benefit-risk assessment. Kolluri et al. provided a review of several case studies that span these stages, featuring key applications such as protein structure prediction, success probability estimation, subgroup identification, and AI-assisted clinical trial monitoring. From a regulatory standpoint, there was a notable uptick in submissions incorporating AI components in 2021. The most prevalent therapeutic areas leveraging AI were oncology (27%), psychiatry (15%), gastroenterology (12%), and neurology (11%). The paradigm of personalized or precision medicine has gained significant traction in recent research, partly due to advancements in AI techniques \cite{hamburg2010path}. This shift has had a transformative impact on the pharmaceutical industry. Departing from the traditio...
Multi-scale Iterative Refinement towards Robust and Versatile Molecular Docking.
EN: Molecular docking is a key computational tool utilized to predict the binding conformations of small molecules to protein targets, which is fundamental in the design of novel drugs. Despite recent advancements in geometric deep learning-based approaches leading to improvements in blind docking efficiency, these methods have encountered notable challenges, such as limited generalization performance on unseen proteins, the inability to concurrently address the settings of blind docking and site-specific docking, and the frequent occurrence of physical implausibilities such as inter-molecular steric clash. In this study, we introduce DeltaDock, a robust and versatile framework designed for efficient molecular docking to overcome these challenges. DeltaDock operates in a two-step process: rapid initial complex structures sampling followed by multi-scale iterative refinement of the initial structures. In the initial stage, to sample accurate structures with high efficiency, we develop a ligand-dependent binding site prediction model founded on large protein models and graph neural networks. This model is then paired with GPU-accelerated sampling algorithms. The sampled structures are up...
NMR Spectroscopy Can Help Accelerate Antiviral Drug Discovery Programs.
EN: Small molecule drugs have an important role to play in combating viral infections, and biophysics support has been central for contributing to the discovery and design of direct acting antivirals. Perhaps one of the most successful biophysical tools for this purpose is NMR spectroscopy when utilized strategically and pragmatically within team workflows and timelines. This report describes some clear examples of how NMR applications contributed to the design of antivirals when combined with medicinal chemistry, biochemistry, X-ray crystallography and computational chemistry. Overall, these multidisciplinary approaches allowed teams to reveal and expose compound physical properties from which design ideas were spawned and tested to achieve the desired successes. Examples are discussed for the discovery of antivirals that target HCV, HIV and SARS-CoV-2.
MultiModal-Learning for Predicting Molecular Properties: A Framework Based on Image and Graph Structures.
EN: The quest for accurate prediction of drug molecule properties poses a fundamental challenge in the realm of Artificial Intelligence Drug Discovery (AIDD). An effective representation of drug molecules emerges as a pivotal component in this pursuit. Contemporary leading-edge research predominantly resorts to self-supervised learning (SSL) techniques to extract meaningful structural representations from large-scale, unlabeled molecular data, subsequently fine-tuning these representations for an array of downstream tasks. However, an inherent shortcoming of these studies lies in their singular reliance on one modality of molecular information, such as molecule image or SMILES representations, thus neglecting the potential complementarity of various molecular modalities. In response to this limitation, we propose MolIG, a novel MultiModaL molecular pre-training framework for predicting molecular properties based on Image and Graph structures. MolIG model innovatively leverages the coherence and correlation between molecule graph and molecule image to execute self-supervised tasks, effectively amalgamating the strengths of both molecular representation forms. This holistic approach allo...
DiffBindFR: An SE(3) Equivariant Network for Flexible Protein-Ligand Docking.
EN: Molecular docking, a key technique in structure-based drug design, plays pivotal roles in protein-ligand interaction modeling, hit identification and optimization, in which accurate prediction of protein-ligand binding mode is essential. Conventional docking approaches perform well in redocking tasks with known protein binding pocket conformation in the complex state. However, in real-world docking scenario without knowing the protein binding conformation for a new ligand, accurately modeling the binding complex structure remains challenging as flexible docking is computationally expensive and inaccurate. Typical deep learning-based docking methods do not explicitly consider protein side chain conformations and fail to ensure the physical plausibility and detailed atomic interactions. In this study, we present DiffBindFR, a full-atom diffusion-based flexible docking model that operates over the product space of ligand overall movements and flexibility and pocket side chain torsion changes. We show that DiffBindFR has higher accuracy in producing native-like binding structures with physically plausible and detailed interactions than available docking methods. Furthermore, in the Apo...
Non-Hermitian molecular dynamics simulations of exciton-polaritons in lossy cavities.
EN: The observation that materials can change their properties when placed inside or near an optical resonator, has sparked a fervid interest in understanding the effects of strong light-matter coupling on molecular dynamics, and several approaches have been proposed to extend the methods of computational chemistry into this regime. Whereas the majority of these approaches have focused on modelling a single molecule coupled to a single cavity mode, changes to chemistry have so far only been observed experimentally when very many molecules are coupled collectively to multiple modes with short lifetimes. While atomistic simulations of many molecules coupled to multiple cavity modes have been performed with semi-classical molecular dynamics, an explicit description of cavity losses has so far been restricted to simulations in which only a very few molecular degrees of freedom were considered. Here, we have implemented an effective non-Hermitian Hamiltonian to explicitly treat cavity losses in large-scale semi-classical molecular dynamics simulations of organic polaritons and used it to perform both mean-field and surface hopping simulations of polariton relaxation, propagation and energy ...
Non-Hermitian molecular dynamics simulations of exciton-polaritons in lossy cavities.
EN: The observation that materials can change their properties when placed inside or near an optical resonator, has sparked a fervid interest in understanding the effects of strong light-matter coupling on molecular dynamics, and several approaches have been proposed to extend the methods of computational chemistry into this regime. Whereas the majority of these approaches have focused on modelling a single molecule coupled to a single cavity mode, changes to chemistry have so far only been observed experimentally when very many molecules are coupled collectively to multiple modes with short lifetimes. While atomistic simulations of many molecules coupled to multiple cavity modes have been performed with semi-classical molecular dynamics, an explicit description of cavity losses has so far been restricted to simulations in which only a very few molecular degrees of freedom were considered. Here, we have implemented an effective non-Hermitian Hamiltonian to explicitly treat cavity losses in large-scale semi-classical molecular dynamics simulations of organic polaritons and used it to perform both mean-field and surface hopping simulations of polariton relaxation, propagation and energy ...
Droplet Size Distribution in Emulsions.
EN: The droplet size in emulsions is known to affect the rheological properties and plays a crucial role in the many applications of emulsions. Despite its importance, the underlying mechanisms governing droplet size in emulsification remain poorly understood. We investigate the average drop size and size distribution upon emulsification with a high-shear mixer for model oil-in-water emulsions stabilized by a surfactant. The size distribution is found to be a log-normal distribution, resulting from the repetitive random breakup of drops. High-shear emulsification, the usual way of making emulsions, is therefore found to be very different from turbulent emulsification given by the Kolmogorov-Hinze theory for which power-law distributions of the drop size are expected. In agreement with this, the mean droplet size does not follow a scaling with the Reynolds number of the emulsification flow, but rather a capillary number scaling based on the viscosity of the continuous phase.
HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs.
EN: Adapting a language model into a specific domain, a.k.a `domain adaption', is a common practice when specialized knowledge, e.g. medicine, is not encapsulated in a general language model like Llama2. The challenge lies in the heterogeneity of data across the two training stages, as it varies in languages, genres, or formats. To tackle this and simplify the learning protocol, we propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine. The developed model, HuatuoGPT-II, has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks, e.g. medical licensing exams. It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine. Expert manual evaluations further validate HuatuoGPT-II's advantages over existing LLMs. Notably, HuatuoGPT-II was benchmarked in a fresh Chinese National Medical Licensing Examination where it achieved the best performance, showcasing not only its effe...
Emerging Drug Interaction Prediction Enabled by Flow-based Graph Neural Network with Biomedical Network.
EN: Accurately predicting drug-drug interactions (DDI) for emerging drugs, which offer possibilities for treating and alleviating diseases, with computational methods can improve patient care and contribute to efficient drug development. However, many existing computational methods require large amounts of known DDI information, which is scarce for emerging drugs. In this paper, we propose EmerGNN, a graph neural network (GNN) that can effectively predict interactions for emerging drugs by leveraging the rich information in biomedical networks. EmerGNN learns pairwise representations of drugs by extracting the paths between drug pairs, propagating information from one drug to the other, and incorporating the relevant biomedical concepts on the paths. The different edges on the biomedical network are weighted to indicate the relevance for the target DDI prediction. Overall, EmerGNN has higher accuracy than existing approaches in predicting interactions for emerging drugs and can identify the most relevant information on the biomedical network.
Emerging Drug Interaction Prediction Enabled by Flow-based Graph Neural Network with Biomedical Network.
EN: Accurately predicting drug-drug interactions (DDI) for emerging drugs, which offer possibilities for treating and alleviating diseases, with computational methods can improve patient care and contribute to efficient drug development. However, many existing computational methods require large amounts of known DDI information, which is scarce for emerging drugs. In this paper, we propose EmerGNN, a graph neural network (GNN) that can effectively predict interactions for emerging drugs by leveraging the rich information in biomedical networks. EmerGNN learns pairwise representations of drugs by extracting the paths between drug pairs, propagating information from one drug to the other, and incorporating the relevant biomedical concepts on the paths. The different edges on the biomedical network are weighted to indicate the relevance for the target DDI prediction. Overall, EmerGNN has higher accuracy than existing approaches in predicting interactions for emerging drugs and can identify the most relevant information on the biomedical network.
Joint Alignment of Multivariate Quasi-Periodic Functional Data Using Deep Learning.
EN: The joint alignment of multivariate functional data plays an important role in various fields such as signal processing, neuroscience and medicine, including the statistical analysis of data from wearable devices. Traditional methods often ignore the phase variability and instead focus on the variability in the observed amplitude. We present a novel method for joint alignment of multivariate quasi-periodic functions using deep neural networks, decomposing, but retaining all the information in the data by preserving both phase and amplitude variability. Our proposed neural network uses a special activation of the output that builds on the unit simplex transformation, and we utilize a loss function based on the Fisher-Rao metric to train our model. Furthermore, our method is unsupervised and can provide an optimal common template function as well as subject-specific templates. We demonstrate our method on two simulated datasets and one real example, comprising data from 12-lead 10s electrocardiogram recordings.
ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering.
EN: Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain...
ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering.
EN: Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain...
ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering.
EN: Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain...
ResMGCN: Residual Message Graph Convolution Network for Fast Biomedical Interactions Discovering.
EN: Biomedical information graphs are crucial for interaction discovering of biomedical information in modern age, such as identification of multifarious molecular interactions and drug discovery, which attracts increasing interests in biomedicine, bioinformatics, and human healthcare communities. Nowadays, more and more graph neural networks have been proposed to learn the entities of biomedical information and precisely reveal biomedical molecule interactions with state-of-the-art results. These methods remedy the fading of features from a far distance but suffer from remedying such problem at the expensive cost of redundant memory and time. In our paper, we propose a novel Residual Message Graph Convolution Network (ResMGCN) for fast and precise biomedical interaction prediction in a different idea. Specifically, instead of enhancing the message from far nodes, ResMGCN aggregates lower-order information with the next round higher information to guide the node update to obtain a more meaningful node representation. ResMGCN is able to perceive and preserve various messages from the previous layer and high-order information in the current layer with least memory and time cost to obtain...
A Saliency-based Clustering Framework for Identifying Aberrant Predictions.
EN: In machine learning, classification tasks serve as the cornerstone of a wide range of real-world applications. Reliable, trustworthy classification is particularly intricate in biomedical settings, where the ground truth is often inherently uncertain and relies on high degrees of human expertise for labeling. Traditional metrics such as precision and recall, while valuable, are insufficient for capturing the nuances of these ambiguous scenarios. Here we introduce the concept of aberrant predictions, emphasizing that the nature of classification errors is as critical as their frequency. We propose a novel, efficient training methodology aimed at both reducing the misclassification rate and discerning aberrant predictions. Our framework demonstrates a substantial improvement in model performance, achieving a 20\% increase in precision. We apply this methodology to the less-explored domain of veterinary radiology, where the stakes are high but have not been as extensively studied compared to human medicine. By focusing on the identification and mitigation of aberrant predictions, we enhance the utility and trustworthiness of machine learning classifiers in high-stakes, real-world scen...
A Saliency-based Clustering Framework for Identifying Aberrant Predictions.
EN: In machine learning, classification tasks serve as the cornerstone of a wide range of real-world applications. Reliable, trustworthy classification is particularly intricate in biomedical settings, where the ground truth is often inherently uncertain and relies on high degrees of human expertise for labeling. Traditional metrics such as precision and recall, while valuable, are insufficient for capturing the nuances of these ambiguous scenarios. Here we introduce the concept of aberrant predictions, emphasizing that the nature of classification errors is as critical as their frequency. We propose a novel, efficient training methodology aimed at both reducing the misclassification rate and discerning aberrant predictions. Our framework demonstrates a substantial improvement in model performance, achieving a 20\% increase in precision. We apply this methodology to the less-explored domain of veterinary radiology, where the stakes are high but have not been as extensively studied compared to human medicine. By focusing on the identification and mitigation of aberrant predictions, we enhance the utility and trustworthiness of machine learning classifiers in high-stakes, real-world scen...
Protein-ligand binding representation learning from fine-grained interactions.
EN: The binding between proteins and ligands plays a crucial role in the realm of drug discovery. Previous deep learning approaches have shown promising results over traditional computationally intensive methods, but resulting in poor generalization due to limited supervised data. In this paper, we propose to learn protein-ligand binding representation in a self-supervised learning manner. Different from existing pre-training approaches which treat proteins and ligands individually, we emphasize to discern the intricate binding patterns from fine-grained interactions. Specifically, this self-supervised learning problem is formulated as a prediction of the conclusive binding complex structure given a pocket and ligand with a Transformer based interaction module, which naturally emulates the binding process. To ensure the representation of rich binding information, we introduce two pre-training tasks, i.e.~atomic pairwise distance map prediction and mask ligand reconstruction, which comprehensively model the fine-grained interactions from both structure and feature space. Extensive experiments have demonstrated the superiority of our method across various binding tasks, including protein...
scBeacon: single-cell biomarker extraction via identifying paired cell clusters across biological conditions with contrastive siamese networks.
EN: Despite the breakthroughs in biomarker discovery facilitated by differential gene analysis, challenges remain, particularly at the single-cell level. Traditional methodologies heavily rely on user-supplied cell annotations, focusing on individually expressed data, often neglecting the critical interactions between biological conditions, such as healthy versus diseased states. In response, here we introduce scBeacon, an innovative framework built upon a deep contrastive siamese network. scBeacon pioneers an unsupervised approach, adeptly identifying matched cell populations across varied conditions, enabling a refined differential gene analysis. By utilizing a VQ-VAE framework, a contrastive siamese network, and a greedy iterative strategy, scBeacon effectively pinpoints differential genes that hold potential as key biomarkers. Comprehensive evaluations on a diverse array of datasets validate scBeacon's superiority over existing single-cell differential gene analysis tools. Its precision and adaptability underscore its significant role in enhancing diagnostic accuracy in biomarker discovery. With the emphasis on the importance of biomarkers in diagnosis, scBeacon is positioned to be...
Crop Disease Classification using Support Vector Machines with Green Chromatic Coordinate (GCC) and Attention based feature extraction for IoT based Smart Agricultural Applications.
EN: Crops hold paramount significance as they serve as the primary provider of energy, nutrition, and medicinal benefits for the human population. Plant diseases, however, can negatively affect leaves during agricultural cultivation, resulting in significant losses in crop output and economic value. Therefore, it is crucial for farmers to identify crop diseases. However, this method frequently necessitates hard work, a lot of planning, and in-depth familiarity with plant pathogens. Given these numerous obstacles, it is essential to provide solutions that can easily interface with mobile and IoT devices so that our farmers can guarantee the best possible crop development. Various machine learning (ML) as well as deep learning (DL) algorithms have been created & studied for the identification of plant disease detection, yielding substantial and promising results. This article presents a novel classification method that builds on prior work by utilising attention-based feature extraction, RGB channel-based chromatic analysis, Support Vector Machines (SVM) for improved performance, and the ability to integrate with mobile applications and IoT devices after quantization of information. Seve...
Delta Score: Improving the Binding Assessment of Structure-Based Drug Design Methods.
EN: Structure-based drug design (SBDD) stands at the forefront of drug discovery, emphasizing the creation of molecules that target specific binding pockets. Recent advances in this area have witnessed the adoption of deep generative models and geometric deep learning techniques, modeling SBDD as a conditional generation task where the target structure serves as context. Historically, evaluation of these models centered on docking scores, which quantitatively depict the predicted binding affinity between a molecule and its target pocket. Though state-of-the-art models purport that a majority of their generated ligands exceed the docking score of ground truth ligands in test sets, it begs the question: Do these scores align with real-world biological needs? In this paper, we introduce the delta score, a novel evaluation metric grounded in tangible pharmaceutical requisites. Our experiments reveal that molecules produced by current deep generative models significantly lag behind ground truth reference ligands when assessed with the delta score. This novel metric not only complements existing benchmarks but also provides a pivotal direction for subsequent research in the domain.
Making informed decisions in cutting tool maintenance in milling: A KNN-based model agnostic approach.
EN: Tool Condition Monitoring (TCM) is vital for maintaining productivity and product quality in machining. This study leverages machine learning to analyze real-time force signals collected from experiments under various tool wear conditions. Statistical analysis and feature selection using decision trees were followed by classification using a K-Nearest Neighbors (KNN) algorithm, with hyperparameter tuning to enhance performance. While machine learning has been widely applied in TCM, interpretability remains limited. This work introduces a KNN-based white-box model that enhances transparency in decision-making by revealing how features influence classification. The model not only detects tool wear but also provides insights into the reasoning behind each decision, enabling manufacturers to make informed maintenance choices.
UniMAP: Universal SMILES-Graph Representation Learning.
EN: Molecular representation learning is fundamental for many drug related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained 'semantics' between SMILES and graph, because subtle sequence/graph differences may lead to contrary molecular properties. In this paper, we propose a universal SMILE-graph representation learning model, namely UniMAP. Firstly, an embedding layer is employed to obtain the token and node/edge representation in SMILES and graph, respectively. A multi-layer Transformer is then utilized to conduct deep cross-modality fusion. Specially, four kinds of pre-training tasks are designed for UniMAP, including Multi-Level Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM), Fragment-Level Alignment (FLA), and Domain Knowledge Learning (DKL). In this way, both global (i.e. SGM and DKL) and local (i.e. CMM and FLA) alignments are integrated to achieve comprehensive cross-modality fusion. We evaluate UniMAP on various downstream tasks, i.e. molecular property prediction, drug-target...
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models.
EN: Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both p...
Named Entity Recognition for Monitoring Plant Health Threats in Tweets: a ChouBERT Approach.
EN: An important application scenario of precision agriculture is detecting and measuring crop health threats using sensors and data analysis techniques. However, the textual data are still under-explored among the existing solutions due to the lack of labelled data and fine-grained semantic resources. Recent research suggests that the increasing connectivity of farmers and the emergence of online farming communities make social media like Twitter a participatory platform for detecting unfamiliar plant health events if we can extract essential information from unstructured textual data. ChouBERT is a French pre-trained language model that can identify Tweets concerning observations of plant health issues with generalizability on unseen natural hazards. This paper tackles the lack of labelled data by further studying ChouBERT's know-how on token-level annotation tasks over small labeled sets.
Prebiotic Vitamin B$_3$ Synthesis in Carbonaceous Planetesimals.
EN: Aqueous chemistry within carbonaceous planetesimals is promising for synthesizing prebiotic organic matter essential to all life. Meteorites derived from these planetesimals delivered these life building blocks to the early Earth, potentially facilitating the origins of life. Here, we studied the formation of vitamin B$_3$ as it is an important precursor of the coenzyme NAD(P)(H), which is essential for the metabolism of all life as we know it. We propose a new reaction mechanism based on known experiments in the literature that explains the synthesis of vitamin B$_3$. It combines the sugar precursors glyceraldehyde or dihydroxyacetone with the amino acids aspartic acid or asparagine in aqueous solution without oxygen or other oxidizing agents. We performed thermochemical equilibrium calculations to test the thermodynamic favorability. The predicted vitamin B$_3$ abundances resulting from this new pathway were compared with measured values in asteroids and meteorites. We conclude that competition for reactants and decomposition by hydrolysis are necessary to explain the prebiotic content of meteorites. In sum, our model fits well into the complex network of chemical pathways active...
GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics.
EN: Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscores the need for new methods capable of realigning cells. Optimal transport (OT) has emerged as a potent solution, but traditional discrete solvers are hampered by scalability, privacy, and out-of-sample estimation issues. These challenges have spurred the development of neural network-based solvers, known as neural OT solvers, that parameterize OT maps. Yet, these models often lack the flexibility needed for broader life science applications. To address these deficiencies, our approach learns stochastic maps (i.e. transport plans), allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle the complex challenges posed by the (Fused) Gromov-Wasserstein problem. Utilizing flow matching as a backbone, our method offers a flexible and effective framework. We demonstrate its versatility and robustness through applications i...
ETDock: A Novel Equivariant Transformer for Protein-Ligand Docking.
EN: Predicting the docking between proteins and ligands is a crucial and challenging task for drug discovery. However, traditional docking methods mainly rely on scoring functions, and deep learning-based docking approaches usually neglect the 3D spatial information of proteins and ligands, as well as the graph-level features of ligands, which limits their performance. To address these limitations, we propose an equivariant transformer neural network for protein-ligand docking pose prediction. Our approach involves the fusion of ligand graph-level features by feature processing, followed by the learning of ligand and protein representations using our proposed TAMformer module. Additionally, we employ an iterative optimization approach based on the predicted distance matrix to generate refined ligand poses. The experimental results on real datasets show that our model can achieve state-of-the-art performance.
Adsorption of fragrance capsules onto cellulose nano- and micro-cellulose fibers in presence of guar biopolymers.
EN: Fabric softeners are formulated to enhance textile softness and impart a pleasant scent. One of the most efficient technologies for controlled fragrance delivery onto fabrics involves encapsulating scent molecules in polymer capsules. Here, we investigate the adsorption of anionic fragrance cap-sules on cotton fabrics with the goal of reducing the reliance on palm-oil-derived surfactants. First, we employ 200 nm-cellulose nanocrystals (CNC) as a reliable model for cotton fibers. CNC enables us to explore interactions among various softener components, including surfactants, guar biopolymers, and fragrances, using physical chemistry techniques applied to bulk dispersions. The primary objec-tive is to elucidate the role of surfactant vesicles, the primary ingredient in textile conditioners, in the association between fragrance capsules and cotton. Secondly, we examine the influence of bi-opolymers present in a newly developed, environmentally friendly softener on this association. Our findings demonstrate that anionic fragrance capsules are deposited onto cotton microfibers in the presence of either cationic surfactants or guar biopolymers, driven by electrostatic interactions. Scann...
FABind: Fast and Accurate Protein-Ligand Binding.
EN: Modeling the interaction between proteins and ligands and accurately predicting their binding structures is a critical yet challenging task in drug discovery. Recent advancements in deep learning have shown promise in addressing this challenge, with sampling-based and regression-based methods emerging as two prominent approaches. However, these methods have notable limitations. Sampling-based methods often suffer from low efficiency due to the need for generating multiple candidate structures for selection. On the other hand, regression-based methods offer fast predictions but may experience decreased accuracy. Additionally, the variation in protein sizes often requires external modules for selecting suitable binding pockets, further impacting efficiency. In this work, we propose $\mathbf{FABind}$, an end-to-end model that combines pocket prediction and docking to achieve accurate and fast protein-ligand binding. $\mathbf{FABind}$ incorporates a unique ligand-informed pocket prediction module, which is also leveraged for docking pose estimation. The model further enhances the docking process by incrementally integrating the predicted pocket to optimize protein-ligand binding, reduc...
DockGame: Cooperative Games for Multimeric Rigid Protein Docking.
EN: Protein interactions and assembly formation are fundamental to most biological processes. Predicting the assembly structure from constituent proteins -- referred to as the protein docking task -- is thus a crucial step in protein design applications. Most traditional and deep learning methods for docking have focused mainly on binary docking, following either a search-based, regression-based, or generative modeling paradigm. In this paper, we focus on the less-studied multimeric (i.e., two or more proteins) docking problem. We introduce DockGame, a novel game-theoretic framework for docking -- we view protein docking as a cooperative game between proteins, where the final assembly structure(s) constitute stable equilibria w.r.t. the underlying game potential. Since we do not have access to the true potential, we consider two approaches - i) learning a surrogate game potential guided by physics-based energy functions and computing equilibria by simultaneous gradient updates, and ii) sampling from the Gibbs distribution of the true potential by learning a diffusion generative model over the action spaces (rotations and translations) of all proteins. Empirically, on the Docking Benchm...
Harmonic Self-Conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design.
EN: A significant amount of protein function requires binding small molecules, including enzymatic catalysis. As such, designing binding pockets for small molecules has several impactful applications ranging from drug synthesis to energy storage. Towards this goal, we first develop HarmonicFlow, an improved generative process over 3D protein-ligand binding structures based on our self-conditioned flow matching objective. FlowSite extends this flow model to jointly generate a protein pocket's discrete residue types and the molecule's binding 3D structure. We show that HarmonicFlow improves upon state-of-the-art generative processes for docking in simplicity, generality, and average sample quality in pocket-level docking. Enabled by this structure modeling, FlowSite designs binding sites substantially better than baseline approaches.
Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs.
EN: Knowledge graphs (KGs) have emerged as a powerful framework for representing and integrating complex biomedical information. However, assembling KGs from diverse sources remains a significant challenge in several aspects, including entity alignment, scalability, and the need for continuous updates to keep pace with scientific advancements. Moreover, the representative power of KGs is often limited by the scarcity of multi-modal data integration. To overcome these challenges, we propose Know2BIO, a general-purpose heterogeneous KG benchmark for the biomedical domain. Know2BIO integrates data from 30 diverse sources, capturing intricate relationships across 11 biomedical categories. It currently consists of ~219,000 nodes and ~6,200,000 edges. Know2BIO is capable of user-directed automated updating to reflect the latest knowledge in biomedical science. Furthermore, Know2BIO is accompanied by multi-modal data: node features including text descriptions, protein and compound sequences and structures, enabling the utilization of emerging natural language processing methods and multi-modal data integration strategies. We evaluate KG representation models on Know2BIO, demonstrating its eff...
Know2BIO: A Comprehensive Dual-View Benchmark for Evolving Biomedical Knowledge Graphs.
EN: Knowledge graphs (KGs) have emerged as a powerful framework for representing and integrating complex biomedical information. However, assembling KGs from diverse sources remains a significant challenge in several aspects, including entity alignment, scalability, and the need for continuous updates to keep pace with scientific advancements. Moreover, the representative power of KGs is often limited by the scarcity of multi-modal data integration. To overcome these challenges, we propose Know2BIO, a general-purpose heterogeneous KG benchmark for the biomedical domain. Know2BIO integrates data from 30 diverse sources, capturing intricate relationships across 11 biomedical categories. It currently consists of ~219,000 nodes and ~6,200,000 edges. Know2BIO is capable of user-directed automated updating to reflect the latest knowledge in biomedical science. Furthermore, Know2BIO is accompanied by multi-modal data: node features including text descriptions, protein and compound sequences and structures, enabling the utilization of emerging natural language processing methods and multi-modal data integration strategies. We evaluate KG representation models on Know2BIO, demonstrating its eff...
De Novo Drug Design with Joint Transformers.
EN: De novo drug design requires simultaneously generating novel molecules outside of training data and predicting their target properties, making it a hard task for generative models. To address this, we propose Joint Transformer that combines a Transformer decoder, Transformer encoder, and a predictor in a joint generative model with shared weights. We formulate a probabilistic black-box optimization algorithm that employs Joint Transformer to generate novel molecules with improved target properties and outperforms other SMILES-based optimization methods in de novo drug design.
Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning.
EN: Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.
Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning.
EN: Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.
Overcoming the Barrier of Orbital-Free Density Functional Theory for Molecular Systems Using Deep Learning.
EN: Orbital-free density functional theory (OFDFT) is a quantum chemistry formulation that has a lower cost scaling than the prevailing Kohn-Sham DFT, which is increasingly desired for contemporary molecular research. However, its accuracy is limited by the kinetic energy density functional, which is notoriously hard to approximate for non-periodic molecular systems. Here we propose M-OFDFT, an OFDFT approach capable of solving molecular systems using a deep learning functional model. We build the essential non-locality into the model, which is made affordable by the concise density representation as expansion coefficients under an atomic basis. With techniques to address unconventional learning challenges therein, M-OFDFT achieves a comparable accuracy with Kohn-Sham DFT on a wide range of molecules untouched by OFDFT before. More attractively, M-OFDFT extrapolates well to molecules much larger than those seen in training, which unleashes the appealing scaling of OFDFT for studying large molecules including proteins, representing an advancement of the accuracy-efficiency trade-off frontier in quantum chemistry.
Language models in molecular discovery.
EN: The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Language models in molecular discovery.
EN: The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Language models in molecular discovery.
EN: The success of language models, especially transformer-based architectures, has trickled into other domains giving rise to "scientific language models" that operate on small molecules, proteins or polymers. In chemistry, language models contribute to accelerating the molecule discovery cycle as evidenced by promising recent findings in early-stage drug discovery. Here, we review the role of language models in molecular discovery, underlining their strength in de novo drug design, property prediction and reaction chemistry. We highlight valuable open-source software assets thus lowering the entry barrier to the field of scientific language modeling. Last, we sketch a vision for future molecular design that combines a chatbot interface with access to computational chemistry tools. Our contribution serves as a valuable resource for researchers, chemists, and AI enthusiasts interested in understanding how language models can and will be used to accelerate chemical discovery.
Concentration Dependence of Elastic and Viscoelastic Properties of Aqueous Solutions of Ficoll and Bovine Serum Albumin by Brillouin Light Scattering Spectroscopy.
EN: The cellular environment is crowded with macromolecules of different shapes and sizes. The effect of this macromolecular crowding has been studied in a variety of synthetic crowding environments: two popular examples are the compact colloid-like Ficoll macromolecule, and the globular protein bovine serum albumin (BSA). Recent studies have indicated a significant component of bound or surface-associated water in these crowders reduces the available free volume. In this work, Brillouin light scattering experiments were performed on aqueous solutions of Ficoll 70 and Ficoll 400 with concentrations ranging from 1 wt% to 35 wt% and BSA with concentrations of 1 wt% to 27 wt%. From the dependence of spectral peak parameters on polymer concentration, we determined fundamental solution properties: hypersound velocity, adiabatic bulk modulus and compressibility, apparent viscosity, and hypersound attenuation. Existing theory that ignores intermolecular interactions can only capture the observed linear trends in the frequency shift up to a threshold concentration, beyond which a quadratic term accounting for intermolecular interactions is necessary. This likely indicates a transition from the...
Evaluation of GPT-3 for Anti-Cancer Drug Sensitivity Prediction.
EN: In this study, we investigated the potential of GPT-3 for the anti-cancer drug sensitivity prediction task using structured pharmacogenomics data across five tissue types and evaluated its performance with zero-shot prompting and fine-tuning paradigms. The drug's smile representation and cell line's genomic mutation features were predictive of the drug response. The results from this study have the potential to pave the way for designing more efficient treatment protocols in precision oncology.
Knowledge Distillation-Empowered Digital Twin for Anomaly Detection.
EN: Cyber-physical systems (CPSs), like train control and management systems (TCMS), are becoming ubiquitous in critical infrastructures. As safety-critical systems, ensuring their dependability during operation is crucial. Digital twins (DTs) have been increasingly studied for this purpose owing to their capability of runtime monitoring and warning, prediction and detection of anomalies, etc. However, constructing a DT for anomaly detection in TCMS necessitates sufficient training data and extracting both chronological and context features with high quality. Hence, in this paper, we propose a novel method named KDDT for TCMS anomaly detection. KDDT harnesses a language model (LM) and a long short-term memory (LSTM) network to extract contexts and chronological features, respectively. To enrich data volume, KDDT benefits from out-of-domain data with knowledge distillation (KD). We evaluated KDDT with two datasets from our industry partner Alstom and obtained the F1 scores of 0.931 and 0.915, respectively, demonstrating the effectiveness of KDDT. We also explored individual contributions of the DT model, LM, and KD to the overall performance of KDDT, via a comprehensive empirical study,...
An automated, high-resolution phenotypic assay for adult Brugia malayi and microfilaria.
EN: Brugia malayi are thread-like parasitic worms and one of the etiological agents of Lymphatic filariasis (LF). Existing anthelmintic drugs to treat LF are effective in reducing the larval microfilaria (mf) counts in human bloodstream but are less effective on adult parasites. To test potential drug candidates, we report a multi-parameter phenotypic assay based on tracking the motility of adult B. malayi and mf in vitro. For adult B. malayi, motility is characterized by the centroid velocity, path curvature, angular velocity, eccentricity, extent, and Euler Number. These parameters are evaluated in experiments with three anthelmintic drugs. For B. malayi mf, motility is extracted from the evolving body skeleton to yield positional data and bending angles at 74 key point. We achieved high-fidelity tracking of complex worm postures (self-occlusions, omega turns, body bending, and reversals) while providing a visual representation of pose estimates and behavioral attributes in both space and time scales.
ReOnto: A Neuro-Symbolic Approach for Biomedical Relation Extraction.
EN: Relation Extraction (RE) is the task of extracting semantic relationships between entities in a sentence and aligning them to relations defined in a vocabulary, which is generally in the form of a Knowledge Graph (KG) or an ontology. Various approaches have been proposed so far to address this task. However, applying these techniques to biomedical text often yields unsatisfactory results because it is hard to infer relations directly from sentences due to the nature of the biomedical relations. To address these issues, we present a novel technique called ReOnto, that makes use of neuro symbolic knowledge for the RE task. ReOnto employs a graph neural network to acquire the sentence representation and leverages publicly accessible ontologies as prior knowledge to identify the sentential relation between two entities. The approach involves extracting the relation path between the two entities from the ontology. We evaluate the effect of using symbolic knowledge from ontologies with graph neural networks. Experimental results on two public biomedical datasets, BioRel and ADE, show that our method outperforms all the baselines (approximately by 3\%).
ReOnto: A Neuro-Symbolic Approach for Biomedical Relation Extraction.
EN: Relation Extraction (RE) is the task of extracting semantic relationships between entities in a sentence and aligning them to relations defined in a vocabulary, which is generally in the form of a Knowledge Graph (KG) or an ontology. Various approaches have been proposed so far to address this task. However, applying these techniques to biomedical text often yields unsatisfactory results because it is hard to infer relations directly from sentences due to the nature of the biomedical relations. To address these issues, we present a novel technique called ReOnto, that makes use of neuro symbolic knowledge for the RE task. ReOnto employs a graph neural network to acquire the sentence representation and leverages publicly accessible ontologies as prior knowledge to identify the sentential relation between two entities. The approach involves extracting the relation path between the two entities from the ontology. We evaluate the effect of using symbolic knowledge from ontologies with graph neural networks. Experimental results on two public biomedical datasets, BioRel and ADE, show that our method outperforms all the baselines (approximately by 3\%).
Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates.
EN: Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical patents and biomedical databases, and identify drug repositioning candidates with both technological potential and scientific evidence. To this end, first, we constructed a scientific biomedical knowledge graph (s-BKG) comprising relationships between drugs, diseases, and genes derived from biomedical databases. Our protocol involves identifying drugs that exhibit limited association with the target disease but are closely located in the s-BKG, as potential drug candidates. We constructed a patent-informed biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information. Finally, we developed a graph embedding protocol to ascertain the structure of the p-BKG, thereby calculating the relevance scores of those candidates with target disease-related patents to evaluate their technolo...
Learning a Patent-Informed Biomedical Knowledge Graph Reveals Technological Potential of Drug Repositioning Candidates.
EN: Drug repositioning-a promising strategy for discovering new therapeutic uses for existing drugs-has been increasingly explored in the computational science literature using biomedical databases. However, the technological potential of drug repositioning candidates has often been overlooked. This study presents a novel protocol to comprehensively analyse various sources such as pharmaceutical patents and biomedical databases, and identify drug repositioning candidates with both technological potential and scientific evidence. To this end, first, we constructed a scientific biomedical knowledge graph (s-BKG) comprising relationships between drugs, diseases, and genes derived from biomedical databases. Our protocol involves identifying drugs that exhibit limited association with the target disease but are closely located in the s-BKG, as potential drug candidates. We constructed a patent-informed biomedical knowledge graph (p-BKG) by adding pharmaceutical patent information. Finally, we developed a graph embedding protocol to ascertain the structure of the p-BKG, thereby calculating the relevance scores of those candidates with target disease-related patents to evaluate their technolo...
Towards Hierarchical Regional Transformer-based Multiple Instance Learning.
EN: The classification of gigapixel histopathology images with deep multiple instance learning models has become a critical task in digital pathology and precision medicine. In this work, we propose a Transformer-based multiple instance learning approach that replaces the traditional learned attention mechanism with a regional, Vision Transformer inspired self-attention mechanism. We present a method that fuses regional patch information to derive slide-level predictions and show how this regional aggregation can be stacked to hierarchically process features on different distance levels. To increase predictive accuracy, especially for datasets with small, local morphological features, we introduce a method to focus the image processing on high attention regions during inference. Our approach is able to significantly improve performance over the baseline on two histopathology datasets and points towards promising directions for further research.
Shape-conditioned 3D Molecule Generation via Equivariant Diffusion Models.
EN: Ligand-based drug design aims to identify novel drug candidates of similar shapes with known active molecules. In this paper, we formulated an in silico shape-conditioned molecule generation problem to generate 3D molecule structures conditioned on the shape of a given molecule. To address this problem, we developed a translation- and rotation-equivariant shape-guided generative model ShapeMol. ShapeMol consists of an equivariant shape encoder that maps molecular surface shapes into latent embeddings, and an equivariant diffusion model that generates 3D molecules based on these embeddings. Experimental results show that ShapeMol can generate novel, diverse, drug-like molecules that retain 3D molecular shapes similar to the given shape condition. These results demonstrate the potential of ShapeMol in designing drug candidates of desired 3D shapes binding to protein target pockets.
Revisiting Skin Tone Fairness in Dermatological Lesion Classification.
EN: Addressing fairness in lesion classification from dermatological images is crucial due to variations in how skin diseases manifest across skin tones. However, the absence of skin tone labels in public datasets hinders building a fair classifier. To date, such skin tone labels have been estimated prior to fairness analysis in independent studies using the Individual Typology Angle (ITA). Briefly, ITA calculates an angle based on pixels extracted from skin images taking into account the lightness and yellow-blue tints. These angles are then categorised into skin tones that are subsequently used to analyse fairness in skin cancer classification. In this work, we review and compare four ITA-based approaches of skin tone classification on the ISIC18 dataset, a common benchmark for assessing skin cancer classification fairness in the literature. Our analyses reveal a high disagreement among previously published studies demonstrating the risks of ITA-based skin tone estimation methods. Moreover, we investigate the causes of such large discrepancy among these approaches and find that the lack of diversity in the ISIC18 dataset limits its use as a testbed for fairness analysis. Finally, we ...
Benchmarking Generated Poses: How Rational is Structure-based Drug Design with Generative Models?.
EN: Deep generative models for structure-based drug design (SBDD), where molecule generation is conditioned on a 3D protein pocket, have received considerable interest in recent years. These methods offer the promise of higher-quality molecule generation by explicitly modelling the 3D interaction between a potential drug and a protein receptor. However, previous work has primarily focused on the quality of the generated molecules themselves, with limited evaluation of the 3D molecule \emph{poses} that these methods produce, with most work simply discarding the generated pose and only reporting a "corrected" pose after redocking with traditional methods. Little is known about whether generated molecules satisfy known physical constraints for binding and the extent to which redocking alters the generated interactions. We introduce PoseCheck, an extensive analysis of multiple state-of-the-art methods and find that generated molecules have significantly more physical violations and fewer key interactions compared to baselines, calling into question the implicit assumption that providing rich 3D structure information improves molecule complementarity. We make recommendations for future rese...
ChatGPT in Drug Discovery: A Case Study on Anti-Cocaine Addiction Drug Development with Chatbots.
EN: The birth of ChatGPT, a cutting-edge language model-based chatbot developed by OpenAI, ushered in a new era in AI. However, due to potential pitfalls, its role in rigorous scientific research is not clear yet. This paper vividly showcases its innovative application within the field of drug discovery. Focused specifically on developing anti-cocaine addiction drugs, the study employs GPT-4 as a virtual guide, offering strategic and methodological insights to researchers working on generative models for drug candidates. The primary objective is to generate optimal drug-like molecules with desired properties. By leveraging the capabilities of ChatGPT, the study introduces a novel approach to the drug discovery process. This symbiotic partnership between AI and researchers transforms how drug development is approached. Chatbots become facilitators, steering researchers towards innovative methodologies and productive paths for creating effective drug candidates. This research sheds light on the collaborative synergy between human expertise and AI assistance, wherein ChatGPT's cognitive abilities enhance the design and development of potential pharmaceutical solutions. This paper not only...
Aspects of the microscopic structure of curcumin solutions with water-dimethylsulfoxide solvent. Molecular dynamics computer simulation study.
EN: We explore some aspects of the microscopic structure of curcumin solutions with water-dimethylsulfoxide solvent of variable composition. Molecular dynamics computer simulations at isobaric-isothermal conditions are used for this purpose. The model consists of the OPLS-UA type model for the enol conformer of curcumin (J. Mol. Liq., 223, 707, 2016), the OPLS model for the dimethylsulfoxide (DMSO) and the SPC/E water model. Radial distributions for the centers of mass of curcumin molecules are evaluated and the corresponding running coordination numbers are analyzed. The disaggregation of curcumin clusters upon increasing the DMSO content in water-DMSO solvent is elucidated. Changes of the distribution of water and DMSO species around curcumin molecules are investigated. A qualitative comparison of our findings with the results of other authors is performed. A possibility to relate predictions of the model with the experimental observations in terms of the so-called critical wateraggregation percentage is discussed.
Liquid Metal Molecular Scissors.
EN: Molecules are the smallest unit in matters that can exist independently, relatively stable, and maintain physical and chemical activities. The atomic species, alignment commands, and chemical bonds are key factors to dominate their structures and properties. Here we disclosed a general chemistry effect that the liquid metals can directly cut off oxygen-containing groups in various molecular matters at room temperature, and then recombine the remaining groups to form functional materials including nano semiconductors. Based on this unique mechanism, we proposed a basic tool and named it as liquid metal scissors for molecular directional clipping and functional transformation. As proof-of-concept, we demonstrated the capabilities of eGaIn scissors made of Ga and In particles, and revealed that the Ga on the surface of eGaIn could directly snatch oxygen atoms from various targeted substances such as H2O, CO2 or CH3OH molecules to form gallium oxides. As illustration, after clipping, the remaining hydrogen atoms of H2O molecules recombined to form H2, while the remaining groups of CH3OH lead to H2, carbon quantum dots, and other related substances. If needed, more molecules can also be...
Liquid Metal Molecular Scissors.
EN: Molecules are the smallest unit in matters that can exist independently, relatively stable, and maintain physical and chemical activities. The atomic species, alignment commands, and chemical bonds are key factors to dominate their structures and properties. Here we disclosed a general chemistry effect that the liquid metals can directly cut off oxygen-containing groups in various molecular matters at room temperature, and then recombine the remaining groups to form functional materials including nano semiconductors. Based on this unique mechanism, we proposed a basic tool and named it as liquid metal scissors for molecular directional clipping and functional transformation. As proof-of-concept, we demonstrated the capabilities of eGaIn scissors made of Ga and In particles, and revealed that the Ga on the surface of eGaIn could directly snatch oxygen atoms from various targeted substances such as H2O, CO2 or CH3OH molecules to form gallium oxides. As illustration, after clipping, the remaining hydrogen atoms of H2O molecules recombined to form H2, while the remaining groups of CH3OH lead to H2, carbon quantum dots, and other related substances. If needed, more molecules can also be...
Liquid Metal Molecular Scissors.
EN: Molecules are the smallest unit in matters that can exist independently, relatively stable, and maintain physical and chemical activities. The atomic species, alignment commands, and chemical bonds are key factors to dominate their structures and properties. Here we disclosed a general chemistry effect that the liquid metals can directly cut off oxygen-containing groups in various molecular matters at room temperature, and then recombine the remaining groups to form functional materials including nano semiconductors. Based on this unique mechanism, we proposed a basic tool and named it as liquid metal scissors for molecular directional clipping and functional transformation. As proof-of-concept, we demonstrated the capabilities of eGaIn scissors made of Ga and In particles, and revealed that the Ga on the surface of eGaIn could directly snatch oxygen atoms from various targeted substances such as H2O, CO2 or CH3OH molecules to form gallium oxides. As illustration, after clipping, the remaining hydrogen atoms of H2O molecules recombined to form H2, while the remaining groups of CH3OH lead to H2, carbon quantum dots, and other related substances. If needed, more molecules can also be...
Liquid Metal Molecular Scissors.
EN: Molecules are the smallest unit in matters that can exist independently, relatively stable, and maintain physical and chemical activities. The atomic species, alignment commands, and chemical bonds are key factors to dominate their structures and properties. Here we disclosed a general chemistry effect that the liquid metals can directly cut off oxygen-containing groups in various molecular matters at room temperature, and then recombine the remaining groups to form functional materials including nano semiconductors. Based on this unique mechanism, we proposed a basic tool and named it as liquid metal scissors for molecular directional clipping and functional transformation. As proof-of-concept, we demonstrated the capabilities of eGaIn scissors made of Ga and In particles, and revealed that the Ga on the surface of eGaIn could directly snatch oxygen atoms from various targeted substances such as H2O, CO2 or CH3OH molecules to form gallium oxides. As illustration, after clipping, the remaining hydrogen atoms of H2O molecules recombined to form H2, while the remaining groups of CH3OH lead to H2, carbon quantum dots, and other related substances. If needed, more molecules can also be...
Molecular docking via quantum approximate optimization algorithm.
EN: Molecular docking plays a pivotal role in drug discovery and precision medicine, enabling us to understand protein functions and advance novel therapeutics. Here, we introduce a potential alternative solution to this problem, the digitized-counterdiabatic quantum approximate optimization algorithm (DC-QAOA), which utilizes counterdiabatic driving and QAOA on a quantum computer. Our method was applied to analyze diverse biological systems, including the SARS-CoV-2 Mpro complex with PM-2-020B, the DPP-4 complex with piperidine fused imidazopyridine 34, and the HIV-1 gp120 complex with JP-III-048. The DC-QAOA exhibits superior performance, providing more accurate and biologically relevant docking results, especially for larger molecular docking problems. Moreover, QAOA-based algorithms demonstrate enhanced hardware compatibility in the noisy intermediate-scale quantum era, indicating their potential for efficient implementation under practical docking scenarios. Our findings underscore quantum computing's potential in drug discovery and offer valuable insights for optimizing protein-ligand docking processes.
Target-aware Variational Auto-encoders for Ligand Generation with Multimodal Protein Representation Learning.
EN: Without knowledge of specific pockets, generating ligands based on the global structure of a protein target plays a crucial role in drug discovery as it helps reduce the search space for potential drug-like candidates in the pipeline. However, contemporary methods require optimizing tailored networks for each protein, which is arduous and costly. To address this issue, we introduce TargetVAE, a target-aware variational auto-encoder that generates ligands with high binding affinities to arbitrary protein targets, guided by a novel multimodal deep neural network built based on graph Transformers as the prior for the generative model. This is the first effort to unify different representations of proteins (e.g., sequence of amino-acids, 3D structure) into a single model that we name as Protein Multimodal Network (PMN). Our multimodal architecture learns from the entire protein structures and is able to capture their sequential, topological and geometrical information. We showcase the superiority of our approach by conducting extensive experiments and evaluations, including the assessment of generative model quality, ligand generation for unseen targets, docking score computation, and ...
Nanocellulose-stabilized Pickering emulsions : fabrication, stabilization, and food applications.
EN: Pickering emulsions have been widely studied due to their good stability and potential applications. Nanocellulose including cellulose nanocrystals (CNCs), cellulose nanofibrils (CNFs), and bacterial cellulose nanofibrils (BCNFs) has emerged as sustainable stabilizers/emulsifiers in food-related Pickering emulsions due to their favorable properties such as renewability, low toxicity, amphiphilicity, biocompatibility, and high aspect ratio. Nanocellulose can be widely obtained from different sources and extraction methods and can effectively stabilize Pickering emulsions via the irreversible adsorption onto oil-water interface. The synergistic effects of nanocellulose and other substances can further enhance the interfacial networks. The nanocellulose-based Pickering emulsions have potential food-related applications in delivery systems, food packaging materials, and fat substitutes. In this review, we highlight key fundamental work and recent reports on nanocellulose-based Pickering emulsion systems. The sources and extraction of nanocellulose and the fabrication of nanocellulose-based Pickering emulsions are briefly summarized. Furthermore, the synergistic stability and food-relat...
Lagrangian statistics of dense emulsions.
EN: The dynamics of dense stabilized emulsions presents a rich phenomenology including chaotic emulsification, non-Newtonian rheology and ageing dynamics at rest. Macroscopic rheology results from the complex droplet microdynamics and, in turn, droplet dynamics is influenced by macroscopic flows via the competing action of hydrodynamic and interfacial stresses, giving rise to a complex tangle of elastoplastic effects, diffusion, breakups and coalescence events. This tight multiscale coupling, together with the daunting challenge of experimentally investigating droplets under flow, hindered the understanding of dense emulsions dynamics. We present results from 3D numerical simulations of dense stabilised emulsions, resolving the shape and dynamics of individual droplets, along with the macroscopic flows. We investigate droplet dispersion statistics, measuring probability density functions (PDF) of droplet displacements and velocities, changing the concentration, in the stirred and ageing regimes. We provide the first measurements ever, in concentrated emulsions, of the relative droplet-droplet separations PDF and of the droplet acceleration PDF, which becomes strongly non-Gaussian as th...
BovineTalk: Machine Learning for Vocalization Analysis of Dairy Cattle under Negative Affective States.
EN: There is a critical need to develop and validate non-invasive animal-based indicators of affective states in livestock species, in order to integrate them into on-farm assessment protocols, potentially via the use of precision livestock farming (PLF) tools. One such promising approach is the use of vocal indicators. The acoustic structure of vocalizations and their functions were extensively studied in important livestock species, such as pigs, horses, poultry and goats, yet cattle remain understudied in this context to date. Cows were shown to produce two types vocalizations: low-frequency calls (LF), produced with the mouth closed, or partially closed, for close distance contacts and open mouth emitted high-frequency calls (HF), produced for long distance communication, with the latter considered to be largely associated with negative affective states. Moreover, cattle vocalizations were shown to contain information on individuality across a wide range of contexts, both negative and positive. Nowadays, dairy cows are facing a series of negative challenges and stressors in a typical production cycle, making vocalizations during negative affective states of special interest for res...
Current Methods for Drug Property Prediction in the Real World.
EN: Predicting drug properties is key in drug discovery to enable de-risking of assets before expensive clinical trials, and to find highly active compounds faster. Interest from the Machine Learning community has led to the release of a variety of benchmark datasets and proposed methods. However, it remains unclear for practitioners which method or approach is most suitable, as different papers benchmark on different datasets and methods, leading to varying conclusions that are not easily compared. Our large-scale empirical study links together numerous earlier works on different datasets and methods; thus offering a comprehensive overview of the existing property classes, datasets, and their interactions with different methods. We emphasise the importance of uncertainty quantification and the time and therefore cost of applying these methods in the drug development decision-making cycle. We discover that the best method depends on the dataset, and that engineered features with classical ML methods often outperform deep learning. Specifically, QSAR datasets are typically best analysed with classical methods such as Gaussian Processes while ADMET datasets are sometimes better described...
Microbial Engineering to Mitigate Methane Emissions in Ruminant Livestock -- A Review.
EN: The most recent and promising strategies for mitigating methane emissions in ruminants are reviewed highlighting the potential of reductive acetogenesis as a viable alternative to methanogenesis. The emergence of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) technology, and its exceptional precision in genome editing, further enhances the prospects of exploring this avenue. Indeed, research in ruminant methane mitigation has been extensive, and over the years has resulted in the development of a wide variety of mitigation strategies. There is no doubt that the concepts of meat alternatives like lab-meat, microbial proteins and plant proteins may produce equivalent emissions. Reducing methane intensity through breeding and diet has been limited by our inability to phenotype ruminants in a high-throughput manner and the intensification of feed-food competition. Although chemical inhibitors have demonstrated effectiveness in manipulating the rumen microbiota to reduce net emissions, their success is constrained in terms of duration and feasibility in grazing system. Progress in making acetogenesis the dominant hydrogen sink in the rumen has been hampered by the th...
Assessing Intra-class Diversity and Quality of Synthetically Generated Images in a Biomedical and Non-biomedical Setting.
EN: In biomedical image analysis, data imbalance is common across several imaging modalities. Data augmentation is one of the key solutions in addressing this limitation. Generative Adversarial Networks (GANs) are increasingly being relied upon for data augmentation tasks. Biomedical image features are sensitive to evaluating the efficacy of synthetic images. These features can have a significant impact on metric scores when evaluating synthetic images across different biomedical imaging modalities. Synthetically generated images can be evaluated by comparing the diversity and quality of real images. Multi-scale Structural Similarity Index Measure and Cosine Distance are used to evaluate intra-class diversity, while Frechet Inception Distance is used to evaluate the quality of synthetic images. Assessing these metrics for biomedical and non-biomedical imaging is important to investigate an informed strategy in evaluating the diversity and quality of synthetic images. In this work, an empirical assessment of these metrics is conducted for the Deep Convolutional GAN in a biomedical and non-biomedical setting. The diversity and quality of synthetic images are evaluated using different sam...
Assessing Intra-class Diversity and Quality of Synthetically Generated Images in a Biomedical and Non-biomedical Setting.
EN: In biomedical image analysis, data imbalance is common across several imaging modalities. Data augmentation is one of the key solutions in addressing this limitation. Generative Adversarial Networks (GANs) are increasingly being relied upon for data augmentation tasks. Biomedical image features are sensitive to evaluating the efficacy of synthetic images. These features can have a significant impact on metric scores when evaluating synthetic images across different biomedical imaging modalities. Synthetically generated images can be evaluated by comparing the diversity and quality of real images. Multi-scale Structural Similarity Index Measure and Cosine Distance are used to evaluate intra-class diversity, while Frechet Inception Distance is used to evaluate the quality of synthetic images. Assessing these metrics for biomedical and non-biomedical imaging is important to investigate an informed strategy in evaluating the diversity and quality of synthetic images. In this work, an empirical assessment of these metrics is conducted for the Deep Convolutional GAN in a biomedical and non-biomedical setting. The diversity and quality of synthetic images are evaluated using different sam...
Using simulation to calibrate real data acquisition in veterinary medicine.
EN: This paper explores the innovative use of simulation environments to enhance data acquisition and diagnostics in veterinary medicine, focusing specifically on gait analysis in dogs. The study harnesses the power of Blender and the Blenderproc library to generate synthetic datasets that reflect diverse anatomical, environmental, and behavioral conditions. The generated data, represented in graph form and standardized for optimal analysis, is utilized to train machine learning algorithms for identifying normal and abnormal gaits. Two distinct datasets with varying degrees of camera angle granularity are created to further investigate the influence of camera perspective on model accuracy. Preliminary results suggest that this simulation-based approach holds promise for advancing veterinary diagnostics by enabling more precise data acquisition and more effective machine learning models. By integrating synthetic and real-world patient data, the study lays a robust foundation for improving overall effectiveness and efficiency in veterinary medicine.
A Computational Topology-based Spatiotemporal Analysis Technique for Honeybee Aggregation.
EN: A primary challenge in understanding collective behavior is characterizing the spatiotemporal dynamics of the group. We employ topological data analysis to explore the structure of honeybee aggregations that form during trophallaxis, which is the direct exchange of food among nestmates. From the positions of individual bees, we build topological summaries called CROCKER matrices to track the morphology of the group as a function of scale and time. Each column of a CROCKER matrix records the number of topological features, such as the number of components or holes, that exist in the data for a range of analysis scales at a given point in time. To detect important changes in the morphology of the group from this information, we first apply dimensionality reduction techniques to these matrices and then use classic clustering and change-point detection algorithms on the resulting scalar data. A test of this methodology on synthetic data from an agent-based model of honeybees and their trophallaxis behavior shows two distinct phases: a dispersed phase that occurs before food is introduced, followed by a food-exchange phase during which aggregations form. We then move to laboratory data,...
Neurosymbolic AI for Reasoning on Biomedical Knowledge Graphs.
EN: Biomedical datasets are often modeled as knowledge graphs (KGs) because they capture the multi-relational, heterogeneous, and dynamic natures of biomedical systems. KG completion (KGC), can, therefore, help researchers make predictions to inform tasks like drug repositioning. While previous approaches for KGC were either rule-based or embedding-based, hybrid approaches based on neurosymbolic artificial intelligence are becoming more popular. Many of these methods possess unique characteristics which make them even better suited toward biomedical challenges. Here, we survey such approaches with an emphasis on their utilities and prospective benefits for biomedicine.
Neurosymbolic AI for Reasoning on Biomedical Knowledge Graphs.
EN: Biomedical datasets are often modeled as knowledge graphs (KGs) because they capture the multi-relational, heterogeneous, and dynamic natures of biomedical systems. KG completion (KGC), can, therefore, help researchers make predictions to inform tasks like drug repositioning. While previous approaches for KGC were either rule-based or embedding-based, hybrid approaches based on neurosymbolic artificial intelligence are becoming more popular. Many of these methods possess unique characteristics which make them even better suited toward biomedical challenges. Here, we survey such approaches with an emphasis on their utilities and prospective benefits for biomedicine.
Ab initio methods for polariton chemistry.
EN: Polariton chemistry exploits the strong interaction between quantized excitations in molecules and quantized photon states in optical cavities to affect chemical reactivity. Molecular polaritons have been experimentally realized by the coupling of electronic, vibrational, and rovibrational transitions to photon modes, which has spurred tremendous theoretical effort to model and explain how polariton formation can influence chemistry. This tutorial review focuses on computational approaches for the electronic strong coupling problem through the combination of familiar techniques from ab initio electronic structure theory and cavity quantum electrodynamics, toward the goal of supplying predictive theories for polariton chemistry. Our aim is to emphasize the relevant theoretical details with enough clarity for newcomers to the field to follow, and to present simple and practical code examples to catalyze further development work.
Ab initio methods for polariton chemistry.
EN: Polariton chemistry exploits the strong interaction between quantized excitations in molecules and quantized photon states in optical cavities to affect chemical reactivity. Molecular polaritons have been experimentally realized by the coupling of electronic, vibrational, and rovibrational transitions to photon modes, which has spurred tremendous theoretical effort to model and explain how polariton formation can influence chemistry. This tutorial review focuses on computational approaches for the electronic strong coupling problem through the combination of familiar techniques from ab initio electronic structure theory and cavity quantum electrodynamics, toward the goal of supplying predictive theories for polariton chemistry. Our aim is to emphasize the relevant theoretical details with enough clarity for newcomers to the field to follow, and to present simple and practical code examples to catalyze further development work.
Ab initio methods for polariton chemistry.
EN: Polariton chemistry exploits the strong interaction between quantized excitations in molecules and quantized photon states in optical cavities to affect chemical reactivity. Molecular polaritons have been experimentally realized by the coupling of electronic, vibrational, and rovibrational transitions to photon modes, which has spurred tremendous theoretical effort to model and explain how polariton formation can influence chemistry. This tutorial review focuses on computational approaches for the electronic strong coupling problem through the combination of familiar techniques from ab initio electronic structure theory and cavity quantum electrodynamics, toward the goal of supplying predictive theories for polariton chemistry. Our aim is to emphasize the relevant theoretical details with enough clarity for newcomers to the field to follow, and to present simple and practical code examples to catalyze further development work.
Molecular-Scale Visualization of Steric Effects of Ligand Binding to Reconstructed Au(111) Surfaces.
EN: Direct imaging of single molecules at nanostructured interfaces is a grand challenge, with potential to enable new, precise material architectures and technologies. Of particular interest are the structural morphology and spectroscopic signatures of the adsorbed molecule, where modern probes are only now being developed with the necessary spatial and energetic resolution to provide detailed information at molecule-surface interface. Here, we directly visualize the binding of individual m-terphenyl isocyanide ligands to a reconstructed Au(111) surface through scanning tunneling microscopy (STM) and inelastic electron tunneling spectroscopy (IETS). The site-dependent steric pressure of the various surface features alters the vibrational fingerprints of the m-terphenyl isocyanides, which is characterized with single-molecule precision through joint experimental and theoretical approaches. This study for the first time provides molecular-level insights into the steric-pressure-enabled surface binding selectivity, as well as its effect on the chemical properties of individual surface-binding ligands.
High-throughput Quantum Chemistry: Empowering the Search for Molecular Candidates behind Unknown Spectral Signatures in Exoplanetary Atmospheres.
EN: The identification of molecules in exoplanetary atmospheres is only possible thanks to the availability of high-resolution molecular spectroscopic data. However, due to its intensive and time-consuming generation process, at present, only on order 100 molecules have high-resolution spectroscopic data available, limiting new molecular detections. Using routine quantum chemistry calculations (i.e., scaled harmonic frequency calculations using the B97-1/def2-TZVPD model chemistry with median errors of 10cm-1), here we present a complementary high-throughput approach to rapidly generate approximate vibrational spectral data for 2743 molecules made from the biologically most important elements C, H, N, O, P and S. Though these data are not accurate enough to enable definitive molecular detections and does not seek to replace the need for high-resolution data, it has powerful applications in identifying potential molecular candidates responsible for unknown spectral features. We explore this application for the 4.1 micron (2439cm-1) feature in the atmospheric spectrum of WASP-39b, listing potential alternative molecular species responsible for this spectral line, together with SO2. Fur...
High-throughput Quantum Chemistry: Empowering the Search for Molecular Candidates behind Unknown Spectral Signatures in Exoplanetary Atmospheres.
EN: The identification of molecules in exoplanetary atmospheres is only possible thanks to the availability of high-resolution molecular spectroscopic data. However, due to its intensive and time-consuming generation process, at present, only on order 100 molecules have high-resolution spectroscopic data available, limiting new molecular detections. Using routine quantum chemistry calculations (i.e., scaled harmonic frequency calculations using the B97-1/def2-TZVPD model chemistry with median errors of 10cm-1), here we present a complementary high-throughput approach to rapidly generate approximate vibrational spectral data for 2743 molecules made from the biologically most important elements C, H, N, O, P and S. Though these data are not accurate enough to enable definitive molecular detections and does not seek to replace the need for high-resolution data, it has powerful applications in identifying potential molecular candidates responsible for unknown spectral features. We explore this application for the 4.1 micron (2439cm-1) feature in the atmospheric spectrum of WASP-39b, listing potential alternative molecular species responsible for this spectral line, together with SO2. Fur...
High-throughput Quantum Chemistry: Empowering the Search for Molecular Candidates behind Unknown Spectral Signatures in Exoplanetary Atmospheres.
EN: The identification of molecules in exoplanetary atmospheres is only possible thanks to the availability of high-resolution molecular spectroscopic data. However, due to its intensive and time-consuming generation process, at present, only on order 100 molecules have high-resolution spectroscopic data available, limiting new molecular detections. Using routine quantum chemistry calculations (i.e., scaled harmonic frequency calculations using the B97-1/def2-TZVPD model chemistry with median errors of 10cm-1), here we present a complementary high-throughput approach to rapidly generate approximate vibrational spectral data for 2743 molecules made from the biologically most important elements C, H, N, O, P and S. Though these data are not accurate enough to enable definitive molecular detections and does not seek to replace the need for high-resolution data, it has powerful applications in identifying potential molecular candidates responsible for unknown spectral features. We explore this application for the 4.1 micron (2439cm-1) feature in the atmospheric spectrum of WASP-39b, listing potential alternative molecular species responsible for this spectral line, together with SO2. Fur...
High-throughput Quantum Chemistry: Empowering the Search for Molecular Candidates behind Unknown Spectral Signatures in Exoplanetary Atmospheres.
EN: The identification of molecules in exoplanetary atmospheres is only possible thanks to the availability of high-resolution molecular spectroscopic data. However, due to its intensive and time-consuming generation process, at present, only on order 100 molecules have high-resolution spectroscopic data available, limiting new molecular detections. Using routine quantum chemistry calculations (i.e., scaled harmonic frequency calculations using the B97-1/def2-TZVPD model chemistry with median errors of 10cm-1), here we present a complementary high-throughput approach to rapidly generate approximate vibrational spectral data for 2743 molecules made from the biologically most important elements C, H, N, O, P and S. Though these data are not accurate enough to enable definitive molecular detections and does not seek to replace the need for high-resolution data, it has powerful applications in identifying potential molecular candidates responsible for unknown spectral features. We explore this application for the 4.1 micron (2439cm-1) feature in the atmospheric spectrum of WASP-39b, listing potential alternative molecular species responsible for this spectral line, together with SO2. Fur...
Geometric Deep Learning for Structure-Based Drug Design: A Survey.
EN: Structure-based drug design (SBDD) leverages the three-dimensional geometry of proteins to identify potential drug candidates. Traditional approaches, rooted in physicochemical modeling and domain expertise, are often resource-intensive. Recent advancements in geometric deep learning, which effectively integrate and process 3D geometric data, alongside breakthroughs in accurate protein structure predictions from tools like AlphaFold, have significantly propelled the field forward. This paper systematically reviews the state-of-the-art in geometric deep learning for SBDD. We begin by outlining foundational tasks in SBDD, discussing prevalent 3D protein representations, and highlighting representative predictive and generative models. Next, we provide an in-depth review of key tasks, including binding site prediction, binding pose generation, de novo molecule generation, linker design, protein pocket generation, and binding affinity prediction. For each task, we present formal problem definitions, key methods, datasets, evaluation metrics, and performance benchmarks. Lastly, we explore current challenges and future opportunities in SBDD. Challenges include oversimplified problem form...
Feeding control and water quality monitoring in aquaculture systems: Opportunities and challenges.
EN: Aquaculture systems can benefit from the recent development of advanced control strategies to reduce operating costs and fish loss and increase growth production efficiency, resulting in fish welfare and health. Monitoring the water quality and controlling feeding are fundamental elements of balancing fish productivity and shaping the fish growth process. Currently, most fish-feeding processes are conducted manually in different phases and rely on time-consuming and challenging artificial discrimination. The feeding control approach influences fish growth and breeding through the feed conversion rate; hence, controlling these feeding parameters is crucial for enhancing fish welfare and minimizing general fishery costs. The high concentration of environmental factors, such as a high ammonia concentration and pH, affect the water quality and fish survival. Therefore, there is a critical need to develop control strategies to determine optimal, efficient, and reliable feeding processes and monitor water quality. This paper reviews the main control design techniques for fish growth in aquaculture systems, namely algorithms that optimize the feeding and water quality of a dynamic fish gr...
Detection and classification of faults aimed at preventive maintenance of PV systems.
EN: Diagnosis in PV systems aims to detect, locate and identify faults. Diagnosing these faults is vital to guarantee energy production and extend the useful life of PV power plants. In the literature, multiple machine learning approaches have been proposed for this purpose. However, few of these works have paid special attention to the detection of fine faults and the specialized process of extraction and selection of features for their classification. A fine fault is one whose characteristic signature is difficult to distinguish to that of a healthy panel. As a contribution to the detection of fine faults (especially of the snail trail type), this article proposes an innovative approach based on the Random Forest (RF) algorithm. This approach uses a complex feature extraction and selection method that improves the computational time of fault classification while maintaining high accuracy.
Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity Prediction.
EN: Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. In addition, different bioassays use varying affinity measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions inevitably introduce systematic noise, which poses a significant challenge to constructing high-precision affinity prediction models. To address these issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (2) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked three-dimensional structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferr...
Analysis, Identification and Prediction of Parkinson Disease Sub-Types and Progression through Machine Learning.
EN: This paper represents a groundbreaking advancement in Parkinson disease (PD) research by employing a novel machine learning framework to categorize PD into distinct subtypes and predict its progression. Utilizing a comprehensive dataset encompassing both clinical and neurological parameters, the research applies advanced supervised and unsupervised learning techniques. This innovative approach enables the identification of subtle, yet critical, patterns in PD manifestation, which traditional methodologies often miss. Significantly, this research offers a path toward personalized treatment strategies, marking a major stride in the precision medicine domain and showcasing the transformative potential of integrating machine learning into medical research.
Balancing the Benefits of Vaccination: an Envy-Free Strategy.
EN: The Covid-19 pandemic revealed the difficulties of vaccinating a population under the circumstances marked by urgency and limited availability of doses while balancing benefits associated with distinct guidelines satisfying specific ethical criteria (J.W. Wu, S.D. John, E.Y. Adashi, Allocating Vaccines in the Pandemic: The Ethical Dimension, The Am. J. of Medicine V.33(11): 1241 - 1242 (2020)). We offer a vaccination strategy that may be useful in this regard. It relies on the mathematical concept of envy-freeness. We consider finding balance by allocating the resource among individuals that seem to be heterogeneous concerning the direct and indirect benefits of vaccination, depending on age. The proposed strategy adapts a constructive approach in the literature based on Sperner`s Lemma to point out an approximate division of doses guaranteeing that both benefits are optimized each time a batch becomes available. Applications using data about population age distributions from diverse countries suggest that, among other features, this strategy maintains the desired balance throughout the entire vaccination period.
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers.
EN: ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of ChatGPT on various benchmark biomedical tasks, such as relation extraction, document classification, question answering, and summarization. To the best of our knowledge, this is the first work that conducts an extensive evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative transformer models, such as BioGPT and BioBART. This suggests that ChatGPT's pre-training on large text corpora makes it quite specialized even in the biomedical domain. Our findings demonstrate that ChatGPT has the potential to be a valuable tool for various tasks in the biomedical domain that lack large annotated data.
Evaluation of ChatGPT on Biomedical Tasks: A Zero-Shot Comparison with Fine-Tuned Generative Transformers.
EN: ChatGPT is a large language model developed by OpenAI. Despite its impressive performance across various tasks, no prior work has investigated its capability in the biomedical domain yet. To this end, this paper aims to evaluate the performance of ChatGPT on various benchmark biomedical tasks, such as relation extraction, document classification, question answering, and summarization. To the best of our knowledge, this is the first work that conducts an extensive evaluation of ChatGPT in the biomedical domain. Interestingly, we find based on our evaluation that in biomedical datasets that have smaller training sets, zero-shot ChatGPT even outperforms the state-of-the-art fine-tuned generative transformer models, such as BioGPT and BioBART. This suggests that ChatGPT's pre-training on large text corpora makes it quite specialized even in the biomedical domain. Our findings demonstrate that ChatGPT has the potential to be a valuable tool for various tasks in the biomedical domain that lack large annotated data.
MolFM: A Multimodal Molecular Foundation Model.
EN: Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. M...
MolFM: A Multimodal Molecular Foundation Model.
EN: Molecular knowledge resides within three different modalities of information sources: molecular structures, biomedical documents, and knowledge bases. Effective incorporation of molecular knowledge from these modalities holds paramount significance in facilitating biomedical research. However, existing multimodal molecular foundation models exhibit limitations in capturing intricate connections between molecular structures and texts, and more importantly, none of them attempt to leverage a wealth of molecular expertise derived from knowledge graphs. In this study, we introduce MolFM, a multimodal molecular foundation model designed to facilitate joint representation learning from molecular structures, biomedical texts, and knowledge graphs. We propose cross-modal attention between atoms of molecular structures, neighbors of molecule entities and semantically related texts to facilitate cross-modal comprehension. We provide theoretical analysis that our cross-modal pre-training captures local and global molecular knowledge by minimizing the distance in the feature space between different modalities of the same molecule, as well as molecules sharing similar structures or functions. M...
The central dogma of biological homochirality: How does chiral information propagate in a prebiotic network?.
EN: Biological systems are homochiral, raising the question of how a racemic mixture of prebiotically synthesized biomolecules could attain a homochiral state at the network level. Based on our recent results, we aim to address a related question of how chiral information might have flowed in a prebiotic network. Utilizing the crystallization properties of the central RNA precursor known as ribose-aminooxazoline (RAO), we showed that its homochiral crystals can be obtained from its fully racemic solution on a magnetic mineral surface, due to the chiral-induced spin selectivity (CISS) effect. Moreover, we uncovered a mechanism facilitated by the CISS effect through which chiral molecules, like RAO, can uniformly magnetize such surfaces in a variety of planetary environments in a persistent manner. All this is very tantalizing, because recent experiments with tRNA analogs demonstrate high stereoselectivity in the attachment of L-amino acids to D-ribonucleotides, enabling the transfer of homochirality from RNA to peptides. Therefore the biological homochirality problem may be reduced to ensuring that a single common RNA precursor (e.g. RAO) can be made homochiral. The emergence of homochi...
ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.
EN: Recent advancements in conversational large language models (LLMs), such as ChatGPT, have demonstrated remarkable promise in various domains, including drug discovery. However, existing works mainly focus on investigating the capabilities of conversational LLMs on chemical reaction and retrosynthesis. While drug editing, a critical task in the drug discovery pipeline, remains largely unexplored. To bridge this gap, we propose ChatDrug, a framework to facilitate the systematic investigation of drug editing using LLMs. ChatDrug jointly leverages a prompt module, a retrieval and domain feedback (ReDF) module, and a conversation module to streamline effective drug editing. We empirically show that ChatDrug reaches the best performance on 33 out of 39 drug editing tasks, encompassing small molecules, peptides, and proteins. We further demonstrate, through 10 case studies, that ChatDrug can successfully identify the key substructures (e.g., the molecule functional groups, peptide motifs, and protein structures) for manipulation, generating diverse and valid suggestions for drug editing. Promisingly, we also show that ChatDrug can offer insightful explanations from a domain-specific persp...
Drug Repurposing Targeting COVID-19 3CL Protease using Molecular Docking and Machine Learning Regression Approach.
EN: The COVID-19 pandemic has initiated a global health emergency, with an exigent need for effective cure. Progressively, drug repurposing is emerging a promise solution as it saves the time, cost and labor. However, the number of drug candidates that have been identified as being repurposed for the treatment of COVID-19 are still insufficient, so more effective and thorough drug exploring strategies are required. In this study, we joint the molecular docking with machine learning regression approaches to find some prospective therapeutic candidates for COVID-19 treatment. We screened the 5903 approved drugs for their inhibition by targeting the main protease 3CL of SARS-CoV-2, which is responsible to replicate the virus. Molecular docking is used to calculate the binding affinities of these drugs to the main protease 3CL. We employed several machine learning regression approaches for QSAR modeling to find out some potential drugs with high binding affinities. Our outcomes demonstrated that the Decision Tree Regression (DTR) model with best scores of R2 and RMSE, is the most suitable model to explore the potential drugs. We shortlisted six favorable drugs. These drugs have novel repur...
Drugst.One -- A plug-and-play solution for online systems medicine and network-based drug repurposing.
EN: In recent decades, the development of new drugs has become increasingly expensive and inefficient, and the molecular mechanisms of most pharmaceuticals remain poorly understood. In response, computational systems and network medicine tools have emerged to identify potential drug repurposing candidates. However, these tools often require complex installation and lack intuitive visual network mining capabilities. To tackle these challenges, we introduce Drugst.One, a platform that assists specialized computational medicine tools in becoming user-friendly, web-based utilities for drug repurposing. With just three lines of code, Drugst.One turns any systems biology software into an interactive web tool for modeling and analyzing complex protein-drug-disease networks. Demonstrating its broad adaptability, Drugst.One has been successfully integrated with 21 computational systems medicine tools. Available at https://drugst.one, Drugst.One has significant potential for streamlining the drug discovery process, allowing researchers to focus on essential aspects of pharmaceutical treatment research.
Evaluation of the MACE Force Field Architecture: from Medicinal Chemistry to Materials Science.
EN: The MACE architecture represents the state of the art in the field of machine learning force fields for a variety of in-domain, extrapolation and low-data regime tasks. In this paper, we further evaluate MACE by fitting models for published benchmark datasets. We show that MACE generally outperforms alternatives for a wide range of systems from amorphous carbon, universal materials modelling, and general small molecule organic chemistry to large molecules and liquid water. We demonstrate the capabilities of the model on tasks ranging from constrained geometry optimisation to molecular dynamics simulations and find excellent performance across all tested domains. We show that MACE is very data efficient, and can reproduce experimental molecular vibrational spectra when trained on as few as 50 randomly selected reference configurations. We further demonstrate that the strictly local atom-centered model is sufficient for such tasks even in the case of large molecules and weakly interacting molecular assemblies.
Evaluation of the MACE Force Field Architecture: from Medicinal Chemistry to Materials Science.
EN: The MACE architecture represents the state of the art in the field of machine learning force fields for a variety of in-domain, extrapolation and low-data regime tasks. In this paper, we further evaluate MACE by fitting models for published benchmark datasets. We show that MACE generally outperforms alternatives for a wide range of systems from amorphous carbon, universal materials modelling, and general small molecule organic chemistry to large molecules and liquid water. We demonstrate the capabilities of the model on tasks ranging from constrained geometry optimisation to molecular dynamics simulations and find excellent performance across all tested domains. We show that MACE is very data efficient, and can reproduce experimental molecular vibrational spectra when trained on as few as 50 randomly selected reference configurations. We further demonstrate that the strictly local atom-centered model is sufficient for such tasks even in the case of large molecules and weakly interacting molecular assemblies.
Evaluation of the MACE Force Field Architecture: from Medicinal Chemistry to Materials Science.
EN: The MACE architecture represents the state of the art in the field of machine learning force fields for a variety of in-domain, extrapolation and low-data regime tasks. In this paper, we further evaluate MACE by fitting models for published benchmark datasets. We show that MACE generally outperforms alternatives for a wide range of systems from amorphous carbon, universal materials modelling, and general small molecule organic chemistry to large molecules and liquid water. We demonstrate the capabilities of the model on tasks ranging from constrained geometry optimisation to molecular dynamics simulations and find excellent performance across all tested domains. We show that MACE is very data efficient, and can reproduce experimental molecular vibrational spectra when trained on as few as 50 randomly selected reference configurations. We further demonstrate that the strictly local atom-centered model is sufficient for such tasks even in the case of large molecules and weakly interacting molecular assemblies.
Partial Annotation Learning for Biomedical Entity Recognition.
EN: Motivation: Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. Results: To address this problem, we systematically study the effectiveness of partial annotation learning methods for biomedical entity recognition over different simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We harmonize 15 biomedical NER corpora encompassing five entity types to serve as a gold standard and compare against two commonly used partial annotation learning models, BiLSTM-Partial-CRF and EER-PubMedBERT, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with mi...
Partial Annotation Learning for Biomedical Entity Recognition.
EN: Motivation: Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. Results: To address this problem, we systematically study the effectiveness of partial annotation learning methods for biomedical entity recognition over different simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We harmonize 15 biomedical NER corpora encompassing five entity types to serve as a gold standard and compare against two commonly used partial annotation learning models, BiLSTM-Partial-CRF and EER-PubMedBERT, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with mi...
Learning Subpocket Prototypes for Generalizable Structure-based Drug Design.
EN: Generating molecules with high binding affinities to target proteins (a.k.a. structure-based drug design) is a fundamental and challenging task in drug discovery. Recently, deep generative models have achieved remarkable success in generating 3D molecules conditioned on the protein pocket. However, most existing methods consider molecular generation for protein pockets independently while neglecting the underlying connections such as subpocket-level similarities. Subpockets are the local protein environments of ligand fragments and pockets with similar subpockets may bind the same molecular fragment (motif) even though their overall structures are different. Therefore, the trained models can hardly generalize to unseen protein pockets in real-world applications. In this paper, we propose a novel method DrugGPS for generalizable structure-based drug design. With the biochemical priors, we propose to learn subpocket prototypes and construct a global interaction graph to model the interactions between subpocket prototypes and molecular motifs. Moreover, a hierarchical graph transformer encoder and motif-based 3D molecule generation scheme are used to improve the model's performance. T...
DermSynth3D: Synthesis of in-the-wild Annotated Dermatology Images.
EN: In recent years, deep learning (DL) has shown great potential in the field of dermatological image analysis. However, existing datasets in this domain have significant limitations, including a small number of image samples, limited disease conditions, insufficient annotations, and non-standardized image acquisitions. To address these shortcomings, we propose a novel framework called DermSynth3D. DermSynth3D blends skin disease patterns onto 3D textured meshes of human subjects using a differentiable renderer and generates 2D images from various camera viewpoints under chosen lighting conditions in diverse background scenes. Our method adheres to top-down rules that constrain the blending and rendering process to create 2D images with skin conditions that mimic in-the-wild acquisitions, ensuring more meaningful results. The framework generates photo-realistic 2D dermoscopy images and the corresponding dense annotations for semantic segmentation of the skin, skin conditions, body parts, bounding boxes around lesions, depth maps, and other 3D scene parameters, such as camera position and lighting conditions. DermSynth3D allows for the creation of custom datasets for various dermatolog...
Vaxformer: Antigenicity-controlled Transformer for Vaccine Design Against SARS-CoV-2.
EN: The SARS-CoV-2 pandemic has emphasised the importance of developing a universal vaccine that can protect against current and future variants of the virus. The present study proposes a novel conditional protein Language Model architecture, called Vaxformer, which is designed to produce natural-looking antigenicity-controlled SARS-CoV-2 spike proteins. We evaluate the generated protein sequences of the Vaxformer model using DDGun protein stability measure, netMHCpan antigenicity score, and a structure fidelity score with AlphaFold to gauge its viability for vaccine development. Our results show that Vaxformer outperforms the existing state-of-the-art Conditional Variational Autoencoder model to generate antigenicity-controlled SARS-CoV-2 spike proteins. These findings suggest promising opportunities for conditional Transformer models to expand our understanding of vaccine design and their role in mitigating global health challenges. The code used in this study is available at https://github.com/aryopg/vaxformer .
Analysing Biomedical Knowledge Graphs using Prime Adjacency Matrices.
EN: Most phenomena related to biomedical tasks are inherently complex, and in many cases, are expressed as signals on biomedical Knowledge Graphs (KGs). In this work, we introduce the use of a new representation framework, the Prime Adjacency Matrix (PAM) for biomedical KGs, which allows for very efficient network analysis. PAM utilizes prime numbers to enable representing the whole KG with a single adjacency matrix and the fast computation of multiple properties of the network. We illustrate the applicability of the framework in the biomedical domain by working on different biomedical knowledge graphs and by providing two case studies: one on drug-repurposing for COVID-19 and one on important metapath extraction. We show that we achieve better results than the original proposed workflows, using very simple methods that require no training, in considerably less time.
Analysing Biomedical Knowledge Graphs using Prime Adjacency Matrices.
EN: Most phenomena related to biomedical tasks are inherently complex, and in many cases, are expressed as signals on biomedical Knowledge Graphs (KGs). In this work, we introduce the use of a new representation framework, the Prime Adjacency Matrix (PAM) for biomedical KGs, which allows for very efficient network analysis. PAM utilizes prime numbers to enable representing the whole KG with a single adjacency matrix and the fast computation of multiple properties of the network. We illustrate the applicability of the framework in the biomedical domain by working on different biomedical knowledge graphs and by providing two case studies: one on drug-repurposing for COVID-19 and one on important metapath extraction. We show that we achieve better results than the original proposed workflows, using very simple methods that require no training, in considerably less time.
Generation of 3D Molecules in Pockets via Language Model.
EN: Generative models for molecules based on sequential line notation (e.g. SMILES) or graph representation have attracted an increasing interest in the field of structure-based drug design, but they struggle to capture important 3D spatial interactions and often produce undesirable molecular structures. To address these challenges, we introduce Lingo3DMol, a pocket-based 3D molecule generation method that combines language models and geometric deep learning technology. A new molecular representation, fragment-based SMILES with local and global coordinates, was developed to assist the model in learning molecular topologies and atomic spatial positions. Additionally, we trained a separate noncovalent interaction predictor to provide essential binding pattern information for the generative model. Lingo3DMol can efficiently traverse drug-like chemical spaces, preventing the formation of unusual structures. The Directory of Useful Decoys-Enhanced (DUD-E) dataset was used for evaluation. Lingo3DMol outperformed state-of-the-art methods in terms of drug-likeness, synthetic accessibility, pocket binding mode, and molecule generation speed.
PulseNet: Deep Learning ECG-signal classification using random augmentation policy and continous wavelet transform for canines.
EN: Evaluating canine electrocardiograms (ECG) require skilled veterinarians, but current availability of veterinary cardiologists for ECG interpretation and diagnostic support is limited. Developing tools for automated assessment of ECG sequences can improve veterinary care by providing clinicians real-time results and decision support tools. We implement a deep convolutional neural network (CNN) approach for classifying canine electrocardiogram sequences as either normal or abnormal. ECG records are converted into 8 second Lead II sequences and classified as either normal (no evidence of cardiac abnormalities) or abnormal (presence of one or more cardiac abnormalities). For training ECG sequences are randomly augmented using RandomAugmentECG, a new augmentation library implemented specifically for this project. Each chunk is then is converted using a continuous wavelet transform into a 2D scalogram. The 2D scalogram are then classified as either normal or abnormal by a binary CNN classifier. Experimental results are validated against three boarded veterinary cardiologists achieving an AUC-ROC score of 0.9506 on test dataset matching human level performance. Additionally, we describe ...
Reactions of Acetonitrile with Trapped, Translationally Cold Acetylene Cations.
EN: The reaction of the acetylene cation (C2H2+) with acetonitrile (CH3CN) is measured in a linear Paul ion trap coupled to a time-of-flight mass spectrometer. C2H2+ and CH3CN are both noted for their astrochemical abundance and predicted relevance for understanding prebiotic chemistry. The observed primary products are c-C3H3+, C3H4+ and C2NH3+. The latter two products react with excess CH3CN to form the secondary product C2NH4+, protonated acetonitrile. The molecular formula of these ionic products can be verified with the aid of isotope substitution via deuteration of the reactants. Primary product reaction pathways and thermodynamics are investigated with quantum chemical calculations and demonstrate exothermic pathways to two isomers of C2NH3+, two isomers of C3H4+, and the cyclopropenyl cation c-C3H3+. This study deepens our understanding of the dynamics and products of a pertinent ion-molecule reaction between two astrochemically abundant molecules in conditions that mimic those of the interstellar medium.
Cooperating Graph Neural Networks with Deep Reinforcement Learning for Vaccine Prioritization.
EN: This study explores the vaccine prioritization strategy to reduce the overall burden of the pandemic when the supply is limited. Existing methods conduct macro-level or simplified micro-level vaccine distribution by assuming the homogeneous behavior within subgroup populations and lacking mobility dynamics integration. Directly applying these models for micro-level vaccine allocation leads to sub-optimal solutions due to the lack of behavioral-related details. To address the issue, we first incorporate the mobility heterogeneity in disease dynamics modeling and mimic the disease evolution process using a Trans-vaccine-SEIR model. Then we develop a novel deep reinforcement learning to seek the optimal vaccine allocation strategy for the high-degree spatial-temporal disease evolution system. The graph neural network is used to effectively capture the structural properties of the mobility contact network and extract the dynamic disease features. In our evaluation, the proposed framework reduces 7% - 10% of infections and deaths than the baseline strategies. Extensive evaluation shows that the proposed framework is robust to seek the optimal vaccine allocation with diverse mobility pat...
In silico Identification of tipifarnib-like compounds by structure-based pharmacophore, virtual screening and molecular docking against K-Ras post-translation in colorectal cancer.
EN: Colorectal cancer is a public health problem.Approximately 30 to 50 \% of colorectal tumors are caused by mutations in the KRAS gene.These mutations induce uncontrolled proliferation.To date,There is no approved effective treatment for the mutated KRAS oncogene.Farnesyltransferase (FTI) inhibitors are considered a therapeutic target against the mutated KRAS oncogene.Tipifarnib is a farnesyltransferase inhibitor that was analyzed in a Phase II trial.In the present study, the three-dimensional structure of farnesyltransferase complexed with tipifarnib [1SA4] was used as a basis to exploit the characteristics of tipifarnib.A pharmacophore model was generated based on the structure using the Asinex (Gold and Platinum Collections) database.A total of 299 molecules were obtained after screening.The 299 molecules were anchored to the tipifarnib binding site in the farnesyltransferase crystal structure for docking analysis.During the molecular docking process, the pharmacophore that was modeled, and was used as a constraint to eliminate the molecules that do not satisfy the pharmacophore.Finally, four Hits identified as farnesyltransferase inhibitors for biological tests. Keywords: color...
Network pharmacology on the mechanism of Yi Qi Tong Qiao Pill inhibiting allergic rhinitis.
EN: Objective: The purpose of this study is to reveal the mechanism of action of Yi Qi Tong Qiao Pill (YQTQP) in the treatment of allergic rhinitis (AR), as well as establish a paradigm for the researches on traditional Chinese medicine (TCM) from systematic perspective. Methods: Based on the data collected from TCM-related and disease-related databases, target profiles of compounds in YQTQP were calculated through network-based algorithms and holistic targets of TQTQP was constructed. Network target analysis was performed to explore the potential mechanisms of YQTQP in the treatment of AR and the mechanisms were classified into different modules according to their biological functions. Besides, animal and clinical experiments were conducted to validate our findings inferred from Network target analysis. Results: Network target analysis showed that YQTQP targeted 12 main pathways or biological processes related to AR, represented by those related to IL-4, IFN-γ, TNF-α and IL-13. These results could be classified into 3 biological modules, including regulation of immune and inflammation, epithelial barrier disorder and cell adhesion. Finally, a series of experiments composed of animal a...
Bridging Heterogeneity Dictates the Microstructure and Yielding Response of Polymer-Linked Emulsions.
EN: Soft materials possessing tunable rheological properties are desirable in applications ranging from 3D printing to biological scaffolds. Here, we use a telechelic, triblock copolymer polystyrene-b-poly(ethylene oxide)-b-polystyrene (SEOS) to form elastic networks of polymer-linked droplets in cyclohexane-in-water emulsions. The SEOS endblocks partition into the dispersed cyclohexane droplets while the midblocks remain in the aqueous continuous phase, resulting in each chain taking on either a looping or bridging conformation. We examine the yield transition of these polymer-linked emulsions through large amplitude oscillatory shear (LAOS) and probe the emulsion structure through confocal microscopy, concluding that polymers that more readily form bridges generate a strongly percolated network, whereas those that are less prone to form bridges tend to produce networks composed of weakly-linked clusters of droplets. When yielded, the emulsions consisting of linked clusters break apart into individual clusters that can rearrange upon the application of further shear. By contrast, when the systems containing a more homogeneous bridging density are yielded, the system remains percolated...
Forecast reconciliation for vaccine supply chain optimization.
EN: Vaccine supply chain optimization can benefit from hierarchical time series forecasting, when grouping the vaccines by type or location. However, forecasts of different hierarchy levels become incoherent when higher levels do not match the sum of the lower levels forecasts, which can be addressed by reconciliation methods. In this paper, we tackle the vaccine sale forecasting problem by modeling sales data from GSK between 2010 and 2021 as a hierarchical time series. After forecasting future values with several ARIMA models, we systematically compare the performance of various reconciliation methods, using statistical tests. We also compare the performance of the forecast before and after COVID. The results highlight Minimum Trace and Weighted Least Squares with Structural scaling as the best performing methods, which provided a coherent forecast while reducing the forecast error of the baseline ARIMA.
A noise-robust acoustic method for recognizing foraging activities of grazing cattle.
EN: Farmers must continuously improve their livestock production systems to remain competitive in the growing dairy market. Precision livestock farming technologies provide individualized monitoring of animals on commercial farms, optimizing livestock production. Continuous acoustic monitoring is a widely accepted sensing technique used to estimate the daily rumination and grazing time budget of free-ranging cattle. However, typical environmental and natural noises on pastures noticeably affect the performance limiting the practical application of current acoustic methods. In this study, we present the operating principle and generalization capability of an acoustic method called Noise-Robust Foraging Activity Recognizer (NRFAR). The proposed method determines foraging activity bouts by analyzing fixed-length segments of identified jaw movement events produced during grazing and rumination. The additive noise robustness of the NRFAR was evaluated for several signal-to-noise ratios using stationary Gaussian white noise and four different nonstationary natural noise sources. In noiseless conditions, NRFAR reached an average balanced accuracy of 86.4%, outperforming two previous acoustic ...
BactInt: A domain driven transfer learning approach and a corpus for extracting inter-bacterial interactions from biomedical text.
EN: The community of different types of microbes present in a biological niche plays a very important role in functioning of the system. The crosstalk or interactions among the different microbes contributes to the building blocks of such microbial community structures. Evidence reported in biomedical text serves as a reliable source for predicting such interactions. However, going through the vast and ever-increasing volume of biomedical literature is an intimidating and time consuming process. This necessitates development of automated methods capable of accurately extracting bacterial relations reported in biomedical literature. In this paper, we introduce a method for automated extraction of microbial interactions (specifically between bacteria) from biomedical literature along with ways of using transfer learning to improve its accuracy. We also describe a pipeline using which relations among specific bacteria groups can be mined. Additionally, we introduce the first publicly available dataset which can be used to develop bacterial interaction extraction methods.
BactInt: A domain driven transfer learning approach and a corpus for extracting inter-bacterial interactions from biomedical text.
EN: The community of different types of microbes present in a biological niche plays a very important role in functioning of the system. The crosstalk or interactions among the different microbes contributes to the building blocks of such microbial community structures. Evidence reported in biomedical text serves as a reliable source for predicting such interactions. However, going through the vast and ever-increasing volume of biomedical literature is an intimidating and time consuming process. This necessitates development of automated methods capable of accurately extracting bacterial relations reported in biomedical literature. In this paper, we introduce a method for automated extraction of microbial interactions (specifically between bacteria) from biomedical literature along with ways of using transfer learning to improve its accuracy. We also describe a pipeline using which relations among specific bacteria groups can be mined. Additionally, we introduce the first publicly available dataset which can be used to develop bacterial interaction extraction methods.
Uni-QSAR: an Auto-ML Tool for Molecular Property Prediction.
EN: Recently deep learning based quantitative structure-activity relationship (QSAR) models has shown surpassing performance than traditional methods for property prediction tasks in drug discovery. However, most DL based QSAR models are restricted to limited labeled data to achieve better performance, and also are sensitive to model scale and hyper-parameters. In this paper, we propose Uni-QSAR, a powerful Auto-ML tool for molecule property prediction tasks. Uni-QSAR combines molecular representation learning (MRL) of 1D sequential tokens, 2D topology graphs, and 3D conformers with pretraining models to leverage rich representation from large-scale unlabeled data. Without any manual fine-tuning or model selection, Uni-QSAR outperforms SOTA in 21/22 tasks of the Therapeutic Data Commons (TDC) benchmark under designed parallel workflow, with an average performance improvement of 6.09\%. Furthermore, we demonstrate the practical usefulness of Uni-QSAR in drug discovery domains.
Identification of interstellar cyanamide towards the hot molecular core G358.93-0.03 MM1.
EN: The amide-related molecules are essential for the formation of the other complex bio-molecules and an understanding of the prebiotic chemistry in the interstellar medium (ISM). We presented the first detection of the rotational emission lines of the amide-like molecule cyanamide (NH${2}$CN) towards the hot molecular core G358.93$-$0.03 MM1 using the Atacama Large Millimeter/Submillimeter Array (ALMA). Using the rotational diagram model, the derived column density of NH${2}$CN towards the G358.93$-$0.03 MM1 was (5.9$\pm$2.5)$\times$10$^{14}$ cm$^{-2}$ with a rotational temperature of 100.6$\pm$30.4 K. The derived fractional abundance of NH${2}$CN towards the G358.93$-$0.03 MM1 with respect to H${2}$ was (4.72$\pm$2.0)$\times$10$^{-10}$, which is very similar to the existent three-phase warm-up chemical model abundances of NH${2}$CN. We compare the estimated abundance of NH${2}$CN towards G358.93$-$0.03 MM1 with other sources, and we observe the abundance of NH${2}$CN towards G358.93$-$0.03 MM1 is nearly similar to that of the sculptor galaxy NGC 253 and the low-mass protostars IRAS 16293-2422 B and NGC 1333 IRAS4A2. We also discussed the possible formation mechanisms of NH${...
SkinGPT-4: An Interactive Dermatology Diagnostic System with Visual Large Language Model.
EN: Skin and subcutaneous diseases rank high among the leading contributors to the global burden of nonfatal diseases, impacting a considerable portion of the population. Nonetheless, the field of dermatology diagnosis faces three significant hurdles. Firstly, there is a shortage of dermatologists accessible to diagnose patients, particularly in rural regions. Secondly, accurately interpreting skin disease images poses a considerable challenge. Lastly, generating patient-friendly diagnostic reports is usually a time-consuming and labor-intensive task for dermatologists. To tackle these challenges, we present SkinGPT-4, which is the world's first interactive dermatology diagnostic system powered by an advanced visual large language model. SkinGPT-4 leverages a fine-tuned version of MiniGPT-4, trained on an extensive collection of skin disease images (comprising 52,929 publicly available and proprietary images) along with clinical concepts and doctors' notes. We designed a two-step training process to allow SkinGPT to express medical features in skin disease images with natural language and make accurate diagnoses of the types of skin diseases. With SkinGPT-4, users could upload their ow...
Cultural-aware Machine Learning based Analysis of COVID-19 Vaccine Hesitancy.
EN: Understanding the COVID-19 vaccine hesitancy, such as who and why, is very crucial since a large-scale vaccine adoption remains as one of the most efficient methods of controlling the pandemic. Such an understanding also provides insights into designing successful vaccination campaigns for future pandemics. Unfortunately, there are many factors involving in deciding whether to take the vaccine, especially from the cultural point of view. To obtain these goals, we design a novel culture-aware machine learning (ML) model, based on our new data collection, for predicting vaccination willingness. We further analyze the most important features which contribute to the ML model's predictions using advanced AI explainers such as the Probabilistic Graphical Model (PGM) and Shapley Additive Explanations (SHAP). These analyses reveal the key factors that most likely impact the vaccine adoption decisions. Our findings show that Hispanic and African American are most likely impacted by cultural characteristics such as religions and ethnic affiliation, whereas the vaccine trust and approval influence the Asian communities the most. Our results also show that cultural characteristics, rumors, and...
Vax-Culture: A Dataset for Studying Vaccine Discourse on Twitter.
EN: Vaccine hesitancy continues to be a main challenge for public health officials during the COVID-19 pandemic. As this hesitancy undermines vaccine campaigns, many researchers have sought to identify its root causes, finding that the increasing volume of anti-vaccine misinformation on social media platforms is a key element of this problem. We explored Twitter as a source of misleading content with the goal of extracting overlapping cultural and political beliefs that motivate the spread of vaccine misinformation. To do this, we have collected a data set of vaccine-related Tweets and annotated them with the help of a team of annotators with a background in communications and journalism. Ultimately we hope this can lead to effective and targeted public health communication strategies for reaching individuals with anti-vaccine beliefs. Moreover, this information helps with developing Machine Learning models to automatically detect vaccine misinformation posts and combat their negative impacts. In this paper, we present Vax-Culture, a novel Twitter COVID-19 dataset consisting of 6373 vaccine-related tweets accompanied by an extensive set of human-provided annotations including vaccine-h...
MIK2 is a candidate gene of the S-locus for sporophytic self-incompatibility (SSI) in chicory (Cichorium intybus, Asteraceae).
EN: The Cichorium genus offers a unique opportunity to study the sporophytic self incompatibility (SSI) system, being composed of species characterized by highly efficient SI (C. intybus) and complete self compatibility (C. endivia). The chicory genome was used to map 7 previously identified SSI locus-associated markers. The region containing the S locus was restricted to an 4 M bp window on chromosome 5. Among the genes predicted in this region, MDIS1 INTERACTING RECEPTOR LIKE KINASE 2 (MIK2) was promising as a candidate for SSI. Its ortholog in Arabidopsis is involved in pollen stigma recognition reactions, and its protein structure is similar to that of S-receptor kinase (SRK), a key component of the SSI in the Brassica genus. The sequencing of MIK2 in chicory and endive accessions revealed two contrasting scenarios. In C. endivia, MIK2 was fully conserved even comparing different botanical varieties (smooth and curly). In C. intybus, 387 SNPs and 3 INDELs were identified when comparing accessions of different biotypes from the same botanical variety (radicchio). The SNP distribution throughout the gene was uneven, with hypervariable domains preferentially localized in the LRR-rich ...
Inverse design of artificial skins.
EN: Mimicking the perceptual functions of human cutaneous mechanoreceptors, artificial skins or flexible pressure sensors can transduce tactile stimuli to quantitative electrical signals. Conventional methods to design such devices follow a forward structure-to-property routine based on trial-and-error experiments/simulations, which take months or longer to determine one solution valid for one specific material. Target-oriented inverse design that shows far higher output efficiency has proven effective in other fields, but is still absent for artificial skins because of the difficulties in acquiring big data. Here, we report a property-to-structure inverse design of artificial skins based on small dataset machine learning, exhibiting a comprehensive efficiency at least four orders of magnitude higher than the conventional routine. The inverse routine can predict hundreds of solutions that overcome the intrinsic signal saturation problem for linear response in hours, and the solutions are valid to a variety of materials. Our results demonstrate that the inverse design allowed by small dataset is an efficient and powerful tool to target multifarious applications of artificial skins, whic...
DiffDock-PP: Rigid Protein-Protein Docking with Diffusion Models.
EN: Understanding how proteins structurally interact is crucial to modern biology, with applications in drug discovery and protein design. Recent machine learning methods have formulated protein-small molecule docking as a generative problem with significant performance boosts over both traditional and deep learning baselines. In this work, we propose a similar approach for rigid protein-protein docking: DiffDock-PP is a diffusion generative model that learns to translate and rotate unbound protein structures into their bound conformations. We achieve state-of-the-art performance on DIPS with a median C-RMSD of 4.85, outperforming all considered baselines. Additionally, DiffDock-PP is faster than all search-based methods and generates reliable confidence estimates for its predictions. Our code is publicly available at $\texttt{https://github.com/ketatam/DiffDock-PP}$
Development and Evaluation of Conformal Prediction Methods for QSAR.
EN: The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds using their molecular descriptors. Predictions from QSAR models can help, for example, to optimize molecular structure; prioritize compounds for further experimental testing; and estimate their toxicity. In addition to the accurate estimation of the activity, it is highly desirable to obtain some estimate of the uncertainty associated with the prediction, e.g., calculate a prediction interval (PI) containing the true molecular activity with a pre-specified probability, say 70%, 90% or 95%. The challenge is that most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction. The development of these algorithms is an active area of research by statistical and ML communities but their implementation for QSAR modeling remains limited. Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution. We proposed computati...
Transformer-based interpretable multi-modal data fusion for skin lesion classification.
EN: A lot of deep learning (DL) research these days is mainly focused on improving quantitative metrics regardless of other factors. In human-centered applications, like skin lesion classification in dermatology, DL-driven clinical decision support systems are still in their infancy due to the limited transparency of their decision-making process. Moreover, the lack of procedures that can explain the behavior of trained DL algorithms leads to almost no trust from clinical physicians. To diagnose skin lesions, dermatologists rely on visual assessment of the disease and the data gathered from the patient's anamnesis. Data-driven algorithms dealing with multi-modal data are limited by the separation of feature-level and decision-level fusion procedures required by convolutional architectures. To address this issue, we enable single-stage multi-modal data fusion via the attention mechanism of transformer-based architectures to aid in diagnosing skin diseases. Our method beats other state-of-the-art single- and multi-modal DL architectures in image-rich and patient-data-rich environments. Additionally, the choice of the architecture enables native interpretability support for the classifica...
MATURE-HEALTH: HEALTH Recommender System for MAndatory FeaTURE choices.
EN: Balancing electrolytes is utmost important and essential for appropriate functioning of organs in human body as electrolytes imbalance can be an indication of the development of underlying pathophysiology. Efficient monitoring of electrolytes imbalance not only can increase the chances of early detection of disease, but also prevents the further deterioration of the health by strictly following nutrient controlled diet for balancing the electrolytes post disease detection. In this research, a recommender system MATURE Health is proposed and implemented, which predicts the imbalance of mandatory electrolytes and other substances presented in blood and recommends the food items with the balanced nutrients to avoid occurrence of the electrolytes imbalance. The proposed model takes user most recent laboratory results and daily food intake into account to predict the electrolytes imbalance. MATURE Health relies on MATURE Food algorithm to recommend food items as latter recommends only those food items that satisfy all mandatory nutrient requirements while also considering user past food preferences. To validate the proposed method, particularly sodium, potassium, and BUN levels have bee...
GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors.
EN: Traditional Korean medicine (TKM) emphasizes individualized diagnosis and treatment. This uniqueness makes AI modeling difficult due to limited data and implicit processes. Large language models (LLMs) have demonstrated impressive medical inference, even without advanced training in medical texts. This study assessed the capabilities of GPT-4 in TKM, using the Korean National Licensing Examination for Korean Medicine Doctors (K-NLEKMD) as a benchmark. The K-NLEKMD, administered by a national organization, encompasses 12 major subjects in TKM. We optimized prompts with Chinese-term annotation, English translation for questions and instruction, exam-optimized instruction, and self-consistency. GPT-4 with optimized prompts achieved 66.18% accuracy, surpassing both the examination's average pass mark of 60% and the 40% minimum for each subject. The gradual introduction of language-related prompts and prompting techniques enhanced the accuracy from 51.82% to its maximum accuracy. GPT-4 showed low accuracy in subjects including public health & medicine-related law, internal medicine (2) which are localized in Korea and TKM. The model's accuracy was lower for questions requiring TKM-speci...
HD-Bind: Encoding of Molecular Structure with Low Precision, Hyperdimensional Binary Representations.
EN: Publicly available collections of drug-like molecules have grown to comprise 10s of billions of possibilities in recent history due to advances in chemical synthesis. Traditional methods for identifying ``hit'' molecules from a large collection of potential drug-like candidates have relied on biophysical theory to compute approximations to the Gibbs free energy of the binding interaction between the drug to its protein target. A major drawback of the approaches is that they require exceptional computing capabilities to consider for even relatively small collections of molecules. Hyperdimensional Computing (HDC) is a recently proposed learning paradigm that is able to leverage low-precision binary vector arithmetic to build efficient representations of the data that can be obtained without the need for gradient-based optimization approaches that are required in many conventional machine learning and deep learning approaches. This algorithmic simplicity allows for acceleration in hardware that has been previously demonstrated for a range of application areas. We consider existing HDC approaches for molecular property classification and introduce two novel encoding algorithms that l...
Materials Discovery with Extreme Properties via Reinforcement Learning-Guided Combinatorial Chemistry.
EN: The goal of most materials discovery is to discover materials that are superior to those currently known. Fundamentally, this is close to extrapolation, which is a weak point for most machine learning models that learn the probability distribution of data. Herein, we develop reinforcement learning-guided combinatorial chemistry, which is a rule-based molecular designer driven by trained policy for selecting subsequent molecular fragments to get a target molecule. Since our model has the potential to generate all possible molecular structures that can be obtained from combinations of molecular fragments, unknown molecules with superior properties can be discovered. We theoretically and empirically demonstrate that our model is more suitable for discovering better compounds than probability distribution-learning models. In an experiment aimed at discovering molecules that hit seven extreme target properties, our model discovered 1,315 of all target-hitting molecules and 7,629 of five target-hitting molecules out of 100,000 trials, whereas the probability distribution-learning models failed. Moreover, it has been confirmed that every molecule generated under the binding rules of molec...
Materials Discovery with Extreme Properties via Reinforcement Learning-Guided Combinatorial Chemistry.
EN: The goal of most materials discovery is to discover materials that are superior to those currently known. Fundamentally, this is close to extrapolation, which is a weak point for most machine learning models that learn the probability distribution of data. Herein, we develop reinforcement learning-guided combinatorial chemistry, which is a rule-based molecular designer driven by trained policy for selecting subsequent molecular fragments to get a target molecule. Since our model has the potential to generate all possible molecular structures that can be obtained from combinations of molecular fragments, unknown molecules with superior properties can be discovered. We theoretically and empirically demonstrate that our model is more suitable for discovering better compounds than probability distribution-learning models. In an experiment aimed at discovering molecules that hit seven extreme target properties, our model discovered 1,315 of all target-hitting molecules and 7,629 of five target-hitting molecules out of 100,000 trials, whereas the probability distribution-learning models failed. Moreover, it has been confirmed that every molecule generated under the binding rules of molec...
Materials Discovery with Extreme Properties via Reinforcement Learning-Guided Combinatorial Chemistry.
EN: The goal of most materials discovery is to discover materials that are superior to those currently known. Fundamentally, this is close to extrapolation, which is a weak point for most machine learning models that learn the probability distribution of data. Herein, we develop reinforcement learning-guided combinatorial chemistry, which is a rule-based molecular designer driven by trained policy for selecting subsequent molecular fragments to get a target molecule. Since our model has the potential to generate all possible molecular structures that can be obtained from combinations of molecular fragments, unknown molecules with superior properties can be discovered. We theoretically and empirically demonstrate that our model is more suitable for discovering better compounds than probability distribution-learning models. In an experiment aimed at discovering molecules that hit seven extreme target properties, our model discovered 1,315 of all target-hitting molecules and 7,629 of five target-hitting molecules out of 100,000 trials, whereas the probability distribution-learning models failed. Moreover, it has been confirmed that every molecule generated under the binding rules of molec...
Materials Discovery with Extreme Properties via Reinforcement Learning-Guided Combinatorial Chemistry.
EN: The goal of most materials discovery is to discover materials that are superior to those currently known. Fundamentally, this is close to extrapolation, which is a weak point for most machine learning models that learn the probability distribution of data. Herein, we develop reinforcement learning-guided combinatorial chemistry, which is a rule-based molecular designer driven by trained policy for selecting subsequent molecular fragments to get a target molecule. Since our model has the potential to generate all possible molecular structures that can be obtained from combinations of molecular fragments, unknown molecules with superior properties can be discovered. We theoretically and empirically demonstrate that our model is more suitable for discovering better compounds than probability distribution-learning models. In an experiment aimed at discovering molecules that hit seven extreme target properties, our model discovered 1,315 of all target-hitting molecules and 7,629 of five target-hitting molecules out of 100,000 trials, whereas the probability distribution-learning models failed. Moreover, it has been confirmed that every molecule generated under the binding rules of molec...
FlexVDW: A machine learning approach to account for protein flexibility in ligand docking.
EN: Most widely used ligand docking methods assume a rigid protein structure. This leads to problems when the structure of the target protein deforms upon ligand binding. In particular, the ligand's true binding pose is often scored very unfavorably due to apparent clashes between ligand and protein atoms, which lead to extremely high values of the calculated van der Waals energy term. Traditionally, this problem has been addressed by explicitly searching for receptor conformations to account for the flexibility of the receptor in ligand binding. Here we present a deep learning model trained to take receptor flexibility into account implicitly when predicting van der Waals energy. We show that incorporating this machine-learned energy term into a state-of-the-art physics-based scoring function improves small molecule ligand pose prediction results in cases with substantial protein deformation, without degrading performance in cases with minimal protein deformation. This work demonstrates the feasibility of learning effects of protein flexibility on ligand binding without explicitly modeling changes in protein structure.
Preoperative Prognosis Assessment of Lumbar Spinal Surgery for Low Back Pain and Sciatica Patients based on Multimodalities and Multimodal Learning.
EN: Low back pain (LBP) and sciatica may require surgical therapy when they are symptomatic of severe pain. However, there is no effective measures to evaluate the surgical outcomes in advance. This work combined elements of Eastern medicine and machine learning, and developed a preoperative assessment tool to predict the prognosis of lumbar spinal surgery in LBP and sciatica patients. Standard operative assessments, traditional Chinese medicine body constitution assessments, planned surgical approach, and vowel pronunciation recordings were collected and stored in different modalities. Our work provides insights into leveraging modality combinations, multimodals, and fusion strategies. The interpretability of models and correlations between modalities were also inspected. Based on the recruited 105 patients, we found that combining standard operative assessments, body constitution assessments, and planned surgical approach achieved the best performance in 0.81 accuracy. Our approach is effective and can be widely applied in general practice due to simplicity and effective.
Interpretability from a new lens: Integrating Stratification and Domain knowledge for Biomedical Applications.
EN: The use of machine learning (ML) techniques in the biomedical field has become increasingly important, particularly with the large amounts of data generated by the aftermath of the COVID-19 pandemic. However, due to the complex nature of biomedical datasets and the use of black-box ML models, a lack of trust and adoption by domain experts can arise. In response, interpretable ML (IML) approaches have been developed, but the curse of dimensionality in biomedical datasets can lead to model instability. This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) and integrating domain knowledge interpretation techniques embedded into the current state-of-the-art IML frameworks. This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models. Specifically, the model outcome, such as aggregated feature weight importance, can be linked to further domain knowledge interpretations using techniques like pathway functional enrichment, drug targeting, and repurposing databases. Additionally, involving end-users and clinicians in focus group discussi...
Interpretability from a new lens: Integrating Stratification and Domain knowledge for Biomedical Applications.
EN: The use of machine learning (ML) techniques in the biomedical field has become increasingly important, particularly with the large amounts of data generated by the aftermath of the COVID-19 pandemic. However, due to the complex nature of biomedical datasets and the use of black-box ML models, a lack of trust and adoption by domain experts can arise. In response, interpretable ML (IML) approaches have been developed, but the curse of dimensionality in biomedical datasets can lead to model instability. This paper proposes a novel computational strategy for the stratification of biomedical problem datasets into k-fold cross-validation (CVs) and integrating domain knowledge interpretation techniques embedded into the current state-of-the-art IML frameworks. This approach can improve model stability, establish trust, and provide explanations for outcomes generated by trained IML models. Specifically, the model outcome, such as aggregated feature weight importance, can be linked to further domain knowledge interpretations using techniques like pathway functional enrichment, drug targeting, and repurposing databases. Additionally, involving end-users and clinicians in focus group discussi...
Engineering long-range molecular potentials by external drive.
EN: We report the engineering of molecular potentials at large interatomic distances. The molecular states are generated by off-resonant optical coupling to a highly excited, long-range Rydberg molecular potential. The coupling produces a potential well in the low-lying molecular potential, which supports a bound state. The depth of the potential well, and thus the binding energy of the molecule, can be tuned by the coupling parameters. We characterize these molecules and find good agreement with a theoretical model based on the coupling of the two involved adiabatic potential energy curves. Our results open numerous possibilities to create long-range molecules between ultracold ground state atoms and to use them for ultracold chemistry and applications such as Feshbach resonances, Efimov physics or the study of halo molecules.
Engineering long-range molecular potentials by external drive.
EN: We report the engineering of molecular potentials at large interatomic distances. The molecular states are generated by off-resonant optical coupling to a highly excited, long-range Rydberg molecular potential. The coupling produces a potential well in the low-lying molecular potential, which supports a bound state. The depth of the potential well, and thus the binding energy of the molecule, can be tuned by the coupling parameters. We characterize these molecules and find good agreement with a theoretical model based on the coupling of the two involved adiabatic potential energy curves. Our results open numerous possibilities to create long-range molecules between ultracold ground state atoms and to use them for ultracold chemistry and applications such as Feshbach resonances, Efimov physics or the study of halo molecules.
Engineering long-range molecular potentials by external drive.
EN: We report the engineering of molecular potentials at large interatomic distances. The molecular states are generated by off-resonant optical coupling to a highly excited, long-range Rydberg molecular potential. The coupling produces a potential well in the low-lying molecular potential, which supports a bound state. The depth of the potential well, and thus the binding energy of the molecule, can be tuned by the coupling parameters. We characterize these molecules and find good agreement with a theoretical model based on the coupling of the two involved adiabatic potential energy curves. Our results open numerous possibilities to create long-range molecules between ultracold ground state atoms and to use them for ultracold chemistry and applications such as Feshbach resonances, Efimov physics or the study of halo molecules.
Engineering long-range molecular potentials by external drive.
EN: We report the engineering of molecular potentials at large interatomic distances. The molecular states are generated by off-resonant optical coupling to a highly excited, long-range Rydberg molecular potential. The coupling produces a potential well in the low-lying molecular potential, which supports a bound state. The depth of the potential well, and thus the binding energy of the molecule, can be tuned by the coupling parameters. We characterize these molecules and find good agreement with a theoretical model based on the coupling of the two involved adiabatic potential energy curves. Our results open numerous possibilities to create long-range molecules between ultracold ground state atoms and to use them for ultracold chemistry and applications such as Feshbach resonances, Efimov physics or the study of halo molecules.
Securing Biomedical Images from Unauthorized Training with Anti-Learning Perturbation.
EN: The volume of open-source biomedical data has been essential to the development of various spheres of the healthcare community since more free' data can provide individual researchers more chances to contribute. However, institutions often hesitate to share their data with the public due to the risk of data exploitation by unauthorized third parties for another commercial usage (e.g., training AI models). This phenomenon might hinder the development of the whole healthcare research community. To address this concern, we propose a novel approach termedunlearnable biomedical image' for protecting biomedical data by injecting imperceptible but delusive noises into the data, making them unexploitable for AI models. We formulate the problem as a bi-level optimization and propose three kinds of anti-learning perturbation generation approaches to solve the problem. Our method is an important step toward encouraging more institutions to contribute their data for the long-term development of the research community.
Securing Biomedical Images from Unauthorized Training with Anti-Learning Perturbation.
EN: The volume of open-source biomedical data has been essential to the development of various spheres of the healthcare community since more free' data can provide individual researchers more chances to contribute. However, institutions often hesitate to share their data with the public due to the risk of data exploitation by unauthorized third parties for another commercial usage (e.g., training AI models). This phenomenon might hinder the development of the whole healthcare research community. To address this concern, we propose a novel approach termedunlearnable biomedical image' for protecting biomedical data by injecting imperceptible but delusive noises into the data, making them unexploitable for AI models. We formulate the problem as a bi-level optimization and propose three kinds of anti-learning perturbation generation approaches to solve the problem. Our method is an important step toward encouraging more institutions to contribute their data for the long-term development of the research community.
Hierarchical discriminative learning improves visual representations of biomedical microscopy.
EN: Learning high-quality, self-supervised, visual representations is essential to advance the role of computer vision in biomedical microscopy and clinical medicine. Previous work has focused on self-supervised representation learning (SSL) methods developed for instance discrimination and applied them directly to image patches, or fields-of-view, sampled from gigapixel whole-slide images (WSIs) used for cancer diagnosis. However, this strategy is limited because it (1) assumes patches from the same patient are independent, (2) neglects the patient-slide-patch hierarchy of clinical biomedical microscopy, and (3) requires strong data augmentations that can degrade downstream performance. Importantly, sampled patches from WSIs of a patient's tumor are a diverse set of image examples that capture the same underlying cancer diagnosis. This motivated HiDisc, a data-driven method that leverages the inherent patient-slide-patch hierarchy of clinical biomedical microscopy to define a hierarchical discriminative learning task that implicitly learns features of the underlying diagnosis. HiDisc uses a self-supervised contrastive learning framework in which positive patch pairs are defined based ...
Hierarchical discriminative learning improves visual representations of biomedical microscopy.
EN: Learning high-quality, self-supervised, visual representations is essential to advance the role of computer vision in biomedical microscopy and clinical medicine. Previous work has focused on self-supervised representation learning (SSL) methods developed for instance discrimination and applied them directly to image patches, or fields-of-view, sampled from gigapixel whole-slide images (WSIs) used for cancer diagnosis. However, this strategy is limited because it (1) assumes patches from the same patient are independent, (2) neglects the patient-slide-patch hierarchy of clinical biomedical microscopy, and (3) requires strong data augmentations that can degrade downstream performance. Importantly, sampled patches from WSIs of a patient's tumor are a diverse set of image examples that capture the same underlying cancer diagnosis. This motivated HiDisc, a data-driven method that leverages the inherent patient-slide-patch hierarchy of clinical biomedical microscopy to define a hierarchical discriminative learning task that implicitly learns features of the underlying diagnosis. HiDisc uses a self-supervised contrastive learning framework in which positive patch pairs are defined based ...
Automatic Classification of Symmetry of Hemithoraces in Canine and Feline Radiographs.
EN: Purpose: Thoracic radiographs are commonly used to evaluate patients with confirmed or suspected thoracic pathology. Proper patient positioning is more challenging in canine and feline radiography than in humans due to less patient cooperation and body shape variation. Improper patient positioning during radiograph acquisition has the potential to lead to a misdiagnosis. Asymmetrical hemithoraces are one of the indications of obliquity for which we propose an automatic classification method. Approach: We propose a hemithoraces segmentation method based on Convolutional Neural Networks (CNNs) and active contours. We utilized the U-Net model to segment the ribs and spine and then utilized active contours to find left and right hemithoraces. We then extracted features from the left and right hemithoraces to train an ensemble classifier which includes Support Vector Machine, Gradient Boosting and Multi-Layer Perceptron. Five-fold cross-validation was used, thorax segmentation was evaluated by Intersection over Union (IoU), and symmetry classification was evaluated using Precision, Recall, Area under Curve and F1 score. Results: Classification of symmetry for 900 radiographs reporte...
Dermatological Diagnosis Explainability Benchmark for Convolutional Neural Networks.
EN: In recent years, large strides have been taken in developing machine learning methods for dermatological applications, supported in part by the success of deep learning (DL). To date, diagnosing diseases from images is one of the most explored applications of DL within dermatology. Convolutional neural networks (ConvNets) are the most common (DL) method in medical imaging due to their training efficiency and accuracy, although they are often described as black boxes because of their limited explainability. One popular way to obtain insight into a ConvNet's decision mechanism is gradient class activation maps (Grad-CAM). A quantitative evaluation of the Grad-CAM explainability has been recently made possible by the release of DermXDB, a skin disease diagnosis explainability dataset which enables explainability benchmarking of ConvNet architectures. In this paper, we perform a literature review to identify the most common ConvNet architectures used for this task, and compare their Grad-CAM explanations with the explanation maps provided by DermXDB. We identified 11 architectures: DenseNet121, EfficientNet-B0, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, NASNetMobile, ResNe...
SGMFQP:An Ontology-based Swine Gut Microbiota Federated Query Platform.
EN: Gut microbiota plays a crucial role in modulating pig development and health, and gut microbiota characteristics are associated with differences in feed efficiency. To answer open questions in feed efficiency analysis, biologists seek to retrieve information across multiple heterogeneous data sources. However, this is error-prone and time-consuming work since the queries can involve a sequence of multiple sub-queries over several databases. We present an implementation of an ontology-based Swine Gut Microbiota Federated Query Platform (SGMFQP) that provides a convenient, automated, and efficient query service about swine feeding and gut microbiota. The system is constructed based on a domain-specific Swine Gut Microbiota Ontology (SGMO), which facilitates the construction of queries independent of the actual organization of the data in the individual sources. This process is supported by a template-based query interface. A Datalog+-based federated query engine transforms the queries into sub-queries tailored for each individual data source, and an automated workflow orchestration mechanism executes the queries in each source database and consolidates the results. The efficiency of ...
CHA2: CHemistry Aware Convex Hull Autoencoder Towards Inverse Molecular Design.
EN: Optimizing molecular design and discovering novel chemical structures to meet certain objectives, such as quantitative estimates of the drug-likeness score (QEDs), is NP-hard due to the vast combinatorial design space of discrete molecular structures, which makes it near impossible to explore the entire search space comprehensively to exploit de novo structures with properties of interest. To address this challenge, reducing the intractable search space into a lower-dimensional latent volume helps examine molecular candidates more feasibly via inverse design. Autoencoders are suitable deep learning techniques, equipped with an encoder that reduces the discrete molecular structure into a latent space and a decoder that inverts the search space back to the molecular design. The continuous property of the latent space, which characterizes the discrete chemical structures, provides a flexible representation for inverse design in order to discover novel molecules. However, exploring this latent space requires certain insights to generate new structures. We propose using a convex hall surrounding the top molecules in terms of high QEDs to ensnare a tight subspace in the latent representa...
CHA2: CHemistry Aware Convex Hull Autoencoder Towards Inverse Molecular Design.
EN: Optimizing molecular design and discovering novel chemical structures to meet certain objectives, such as quantitative estimates of the drug-likeness score (QEDs), is NP-hard due to the vast combinatorial design space of discrete molecular structures, which makes it near impossible to explore the entire search space comprehensively to exploit de novo structures with properties of interest. To address this challenge, reducing the intractable search space into a lower-dimensional latent volume helps examine molecular candidates more feasibly via inverse design. Autoencoders are suitable deep learning techniques, equipped with an encoder that reduces the discrete molecular structure into a latent space and a decoder that inverts the search space back to the molecular design. The continuous property of the latent space, which characterizes the discrete chemical structures, provides a flexible representation for inverse design in order to discover novel molecules. However, exploring this latent space requires certain insights to generate new structures. We propose using a convex hall surrounding the top molecules in terms of high QEDs to ensnare a tight subspace in the latent representa...
CHA2: CHemistry Aware Convex Hull Autoencoder Towards Inverse Molecular Design.
EN: Optimizing molecular design and discovering novel chemical structures to meet certain objectives, such as quantitative estimates of the drug-likeness score (QEDs), is NP-hard due to the vast combinatorial design space of discrete molecular structures, which makes it near impossible to explore the entire search space comprehensively to exploit de novo structures with properties of interest. To address this challenge, reducing the intractable search space into a lower-dimensional latent volume helps examine molecular candidates more feasibly via inverse design. Autoencoders are suitable deep learning techniques, equipped with an encoder that reduces the discrete molecular structure into a latent space and a decoder that inverts the search space back to the molecular design. The continuous property of the latent space, which characterizes the discrete chemical structures, provides a flexible representation for inverse design in order to discover novel molecules. However, exploring this latent space requires certain insights to generate new structures. We propose using a convex hall surrounding the top molecules in terms of high QEDs to ensnare a tight subspace in the latent representa...
'The Taurus': Cattle Breeds & Diseases Identification Mobile Application using Machine Learning.
EN: Dairy farming plays an important role in agriculture for thousands of years not only in Sri Lanka but also in so many other countries. When it comes to dairy farming cattle is an indispensable animal. According to the literature surveys almost 3.9 million cattle and calves die in a year due to different types of diseases. The causes of diseases are mainly bacteria, parasites, fungi, chemical poisons and etc. Infectious diseases can be a greatest threat to livestock health. The mortality rate of cattle causes a huge impact on social, economic and environmental damage. In order to decrease this negative impact, the proposal implements a cross-platform mobile application to easily analyze and identify the diseases which cattle suffer from and give them a solution and also to identify the cattle breeds. The mobile application is designed to identify the breeds by analyzing the images of the cattle and identify diseases after analyzing the videos and the images of affected areas. Then make a model to identify the weight and the age of a particular cow and suggest the best dose of the medicine to the identified disease. This will be a huge advantage to farmers as well as to dairy industr...
Do Deep Learning Models Really Outperform Traditional Approaches in Molecular Docking?.
EN: Molecular docking, given a ligand molecule and a ligand binding site (called ``pocket'') on a protein, predicting the binding mode of the protein-ligand complex, is a widely used technique in drug design. Many deep learning models have been developed for molecular docking, while most existing deep learning models perform docking on the whole protein, rather than on a given pocket as the traditional molecular docking approaches, which does not match common needs. What's more, they claim to perform better than traditional molecular docking, but the approach of comparison is not fair, since traditional methods are not designed for docking on the whole protein without a given pocket. In this paper, we design a series of experiments to examine the actual performance of these deep learning models and traditional methods. For a fair comparison, we decompose the docking on the whole protein into two steps, pocket searching and docking on a given pocket, and build pipelines to evaluate traditional methods and deep learning methods respectively. We find that deep learning models are actually good at pocket searching, but traditional methods are better than deep learning models at docking on ...
PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding.
EN: Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional ...
PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding.
EN: Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional ...
PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding.
EN: Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional ...
PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding.
EN: Is there a unified model for generating molecules considering different conditions, such as binding pockets and chemical properties? Although target-aware generative models have made significant advances in drug design, they do not consider chemistry conditions and cannot guarantee the desired chemical properties. Unfortunately, merging the target-aware and chemical-aware models into a unified model to meet customized requirements may lead to the problem of negative transfer. Inspired by the success of multi-task learning in the NLP area, we use prefix embeddings to provide a novel generative model that considers both the targeted pocket's circumstances and a variety of chemical properties. All conditional information is represented as learnable features, which the generative model subsequently employs as a contextual prompt. Experiments show that our model exhibits good controllability in both single and multi-conditional molecular generation. The controllability enables us to outperform previous structure-based drug design methods. More interestingly, we open up the attention mechanism and reveal coupling relationships between conditions, providing guidance for multi-conditional ...
3D Molecular Generation via Virtual Dynamics.
EN: Structure-based drug design, i.e., finding molecules with high affinities to the target protein pocket, is one of the most critical tasks in drug discovery. Traditional solutions, like virtual screening, require exhaustively searching on a large molecular database, which are inefficient and cannot return novel molecules beyond the database. The pocket-based 3D molecular generation model, i.e., directly generating a molecule with a 3D structure and binding position in the pocket, is a new promising way to address this issue. Herein, we propose VD-Gen, a novel pocket-based 3D molecular generation pipeline. VD-Gen consists of several carefully designed stages to generate fine-grained 3D molecules with binding positions in the pocket cavity end-to-end. Rather than directly generating or sampling atoms with 3D positions in the pocket like in early attempts, in VD-Gen, we first randomly initialize many virtual particles in the pocket; then iteratively move these virtual particles, making the distribution of virtual particles approximate the distribution of molecular atoms. After virtual particles are stabilized in 3D space, we extract a 3D molecule from them. Finally, we further refine a...
Protein-protein docking using a tensor train black-box optimization method.
EN: Black-box optimization methods play an important role in many fields of computational simulation. In particular, such methods are often used in the design and modelling of biological systems, including proteins and their complexes with various ligands. This work is mainly focused on the protein-protein docking that plays a key role in modern drug-design workflows. We develop a black-box approach for such docking problems using a novel technique based on the tensor-train decomposition of high-dimensional interaction functions. Our method shows an advantage in terms of the discovered global minima and has a high potential for further implementation on a wide range of devices, including graphical processing units and quantum processing units.
The big challenge for livestock genomics is to make sequence data pay.
EN: This paper will argue that one of the biggest challenges for livestock genomics is to make whole-genome sequencing and functional genomics applicable to breeding practice. It discusses potential explanations for why it is so difficult to consistently improve the accuracy of genomic prediction by means of whole-genome sequence data, and three potential attacks on the problem.
Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations.
EN: The study of molecule-target interaction is quite important for drug discovery in terms of target identification, hit identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different mo...
Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations.
EN: The study of molecule-target interaction is quite important for drug discovery in terms of target identification, hit identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different mo...
Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations.
EN: The study of molecule-target interaction is quite important for drug discovery in terms of target identification, hit identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different mo...
Predicting Molecule-Target Interaction by Learning Biomedical Network and Molecule Representations.
EN: The study of molecule-target interaction is quite important for drug discovery in terms of target identification, hit identification, pathway study, drug-drug interaction, etc. Most existing methodologies utilize either biomedical network information or molecule structural features to predict potential interaction link. However, the biomedical network information based methods usually suffer from cold start problem, while structure based methods often give limited performance due to the structure/interaction assumption and data quality. To address these issues, we propose a pseudo-siamese Graph Neural Network method, namely MTINet+, which learns both biomedical network topological and molecule structural/chemical information as representations to predict potential interaction of given molecule and target pair. In MTINet+, 1-hop subgraphs of given molecule and target pair are extracted from known interaction of biomedical network as topological information, meanwhile the molecule structural and chemical attributes are processed as molecule information. MTINet+ learns these two types of information as embedding features for predicting the pair link. In the experiments of different mo...
Exploring QSAR Models for Activity-Cliff Prediction.
EN: Pairs of similar compounds that only differ by a small structural modification but exhibit a large difference in their binding affinity for a given target are known as activity cliffs (ACs). It has been hypothesised that quantitative structure-activity relationship (QSAR) models struggle to predict ACs and that ACs thus form a major source of prediction error. However, a study to explore the AC-prediction power of modern QSAR methods and its relationship to general QSAR-prediction performance is lacking. We systematically construct nine distinct QSAR models by combining three molecular representation methods (extended-connectivity fingerprints, physicochemical-descriptor vectors and graph isomorphism networks) with three regression techniques (random forests, k-nearest neighbours and multilayer perceptrons); we then use each resulting model to classify pairs of similar compounds as ACs or non-ACs and to predict the activities of individual molecules in three case studies: dopamine receptor D2, factor Xa, and SARS-CoV-2 main protease. We observe low AC-sensitivity amongst the tested models when the activities of both compounds are unknown, but a substantial increase in AC-sensitivit...
Few-Shot Learning Enables Population-Scale Analysis of Leaf Traits in Populus trichocarpa.
EN: Plant phenotyping is typically a time-consuming and expensive endeavor, requiring large groups of researchers to meticulously measure biologically relevant plant traits, and is the main bottleneck in understanding plant adaptation and the genetic architecture underlying complex traits at population scale. In this work, we address these challenges by leveraging few-shot learning with convolutional neural networks (CNNs) to segment the leaf body and visible venation of 2,906 P. trichocarpa leaf images obtained in the field. In contrast to previous methods, our approach (i) does not require experimental or image pre-processing, (ii) uses the raw RGB images at full resolution, and (iii) requires very few samples for training (e.g., just eight images for vein segmentation). Traits relating to leaf morphology and vein topology are extracted from the resulting segmentations using traditional open-source image-processing tools, validated using real-world physical measurements, and used to conduct a genome-wide association study to identify genes controlling the traits. In this way, the current work is designed to provide the plant phenotyping community with (i) methods for fast and accurat...
FE-TCM: Filter-Enhanced Transformer Click Model for Web Search.
EN: Constructing click models and extracting implicit relevance feedback information from the interaction between users and search engines are very important to improve the ranking of search results. Using neural network to model users' click behaviors has become one of the effective methods to construct click models. In this paper, We use Transformer as the backbone network of feature extraction, add filter layer innovatively, and propose a new Filter-Enhanced Transformer Click Model (FE-TCM) for web search. Firstly, in order to reduce the influence of noise on user behavior data, we use the learnable filters to filter log noise. Secondly, following the examination hypothesis, we model the attraction estimator and examination predictor respectively to output the attractiveness scores and examination probabilities. A novel transformer model is used to learn the deeper representation among different features. Finally, we apply the combination functions to integrate attractiveness scores and examination probabilities into the click prediction. From our experiments on two real-world session datasets, it is proved that FE-TCM outperforms the existing click models for the click prediction.
Giving life to robotic skins.
EN: The skin of humanoid robots often lacks human tactility and the inherent self-repair capability of biological tissues. Recently, researchers have grown a living, self-healing skin on a robot finger by subsequent culturing of human dermal and epidermal cells. Here, we highlight the significance of this study alongside challenges toward developing biohybrid robots equipped with sensate and adaptive living robotic skins.
Plant species richness prediction from DESIS hyperspectral data: A comparison study on feature extraction procedures and regression models.
EN: The diversity of terrestrial vascular plants plays a key role in maintaining the stability and productivity of ecosystems. Airborne hyperspectral imaging has shown promise for measuring plant diversity remotely, but to operationalise these efforts over large regions we need to advance satellite-based alternatives. The advanced spectral and spatial specification of the recently launched DESIS (the DLR Earth Sensing Imaging Spectrometer) instrument provides a unique opportunity to test the potential for monitoring plant species diversity with spaceborne hyperspectral data. This study provides a quantitative assessment on the ability of DESIS hyperspectral data for predicting plant species richness in two different habitat types in southeast Australia. Spectral features were first extracted from the DESIS spectra, then regressed against on-ground estimates of plant species richness, with a two-fold cross validation scheme to assess the predictive performance. We tested and compared the effectiveness of Principal Component Analysis (PCA), Canonical Correlation Analysis (CCA), and Partial Least Squares analysis (PLS) for feature extraction, and Kernel Ridge Regression (KRR), Gaussian Pr...
Enhancement attacks in biomedical machine learning.
EN: The prevalence of machine learning in biomedical research is rapidly growing, yet the trustworthiness of such research is often overlooked. While some previous works have investigated the ability of adversarial attacks to degrade model performance in medical imaging, the ability to falsely improve performance via recently-developed "enhancement attacks" may be a greater threat to biomedical machine learning. In the spirit of developing attacks to better understand trustworthiness, we developed two techniques to drastically enhance prediction performance of classifiers with minimal changes to features: 1) general enhancement of prediction performance, and 2) enhancement of a particular method over another. Our enhancement framework falsely improved classifiers' accuracy from 50% to almost 100% while maintaining high feature similarities between original and enhanced data (Pearson's r's>0.99). Similarly, the method-specific enhancement framework was effective in falsely improving the performance of one method over another. For example, a simple neural network outperformed logistic regression by 17% on our enhanced dataset, although no performance differences were present in the origi...
Enhancement attacks in biomedical machine learning.
EN: The prevalence of machine learning in biomedical research is rapidly growing, yet the trustworthiness of such research is often overlooked. While some previous works have investigated the ability of adversarial attacks to degrade model performance in medical imaging, the ability to falsely improve performance via recently-developed "enhancement attacks" may be a greater threat to biomedical machine learning. In the spirit of developing attacks to better understand trustworthiness, we developed two techniques to drastically enhance prediction performance of classifiers with minimal changes to features: 1) general enhancement of prediction performance, and 2) enhancement of a particular method over another. Our enhancement framework falsely improved classifiers' accuracy from 50% to almost 100% while maintaining high feature similarities between original and enhanced data (Pearson's r's>0.99). Similarly, the method-specific enhancement framework was effective in falsely improving the performance of one method over another. For example, a simple neural network outperformed logistic regression by 17% on our enhanced dataset, although no performance differences were present in the origi...
Synthesis-driven design of 3D molecules for structure-based drug discovery using geometric transformers.
EN: Finding drug-like compounds with high bioactivity is essential for drug discovery, but the task is complicated by the high cost of chemical synthesis and validation. With their outstanding performance in de novo drug design, deep generative models represent promising tools for tackling this challenge. In recently years, 3D molecule generative models have gained increasing attention due to their ability to directly utilize the 3D interaction information between the target and ligand. However, it remains challenging to synthesize the molecules generated by these models, limiting the speed of bioactivity validation and further structure optimization. In this work, we propose DeepLigBuilder+, a deep generative model for 3D molecules that combines structure-based de novo drug design with a reaction-based generation framework. Besides producing 3D molecular structures, the model also proposes synthetic pathways for generated molecules, which greatly assists the retro-synthetic analysis. To achieve this, we developed a new way to enforce the synthesizability constraint using a tree-based organization of purchasable building blocks. This method enjoys high scalability and is compatible wit...
Novel Deep Learning Framework For Bovine Iris Segmentation.
EN: Iris segmentation is the initial step to identify biometric of animals to establish a traceability system of livestock. In this study, we propose a novel deep learning framework for pixel-wise segmentation with minimum use of annotation labels using BovineAAEyes80 public dataset. In the experiment, U-Net with VGG16 backbone was selected as the best combination of encoder and decoder model, demonstrating a 99.50% accuracy and a 98.35% Dice coefficient score. Remarkably, the selected model accurately segmented corrupted images even without proper annotation data. This study contributes to the advancement of the iris segmentation and the development of a reliable DNNs training framework.
Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing.
EN: There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.
Can NLI Provide Proper Indirect Supervision for Low-resource Biomedical Relation Extraction?.
EN: Two key obstacles in biomedical relation extraction (RE) are the scarcity of annotations and the prevalence of instances without explicitly pre-defined labels due to low annotation coverage. Existing approaches, which treat biomedical RE as a multi-class classification task, often result in poor generalization in low-resource settings and do not have the ability to make selective prediction on unknown cases but give a guess from seen relations, hindering the applicability of those approaches. We present NBR, which converts biomedical RE as natural language inference formulation through indirect supervision. By converting relations to natural language hypotheses, NBR is capable of exploiting semantic cues to alleviate annotation scarcity. By incorporating a ranking-based loss that implicitly calibrates abstinent instances, NBR learns a clearer decision boundary and is instructed to abstain on uncertain instances. Extensive experiments on three widely-used biomedical RE benchmarks, namely ChemProt, DDI and GAD, verify the effectiveness of NBR in both full-set and low-resource regimes. Our analysis demonstrates that indirect supervision benefits biomedical RE even when a domain gap ex...
Can NLI Provide Proper Indirect Supervision for Low-resource Biomedical Relation Extraction?.
EN: Two key obstacles in biomedical relation extraction (RE) are the scarcity of annotations and the prevalence of instances without explicitly pre-defined labels due to low annotation coverage. Existing approaches, which treat biomedical RE as a multi-class classification task, often result in poor generalization in low-resource settings and do not have the ability to make selective prediction on unknown cases but give a guess from seen relations, hindering the applicability of those approaches. We present NBR, which converts biomedical RE as natural language inference formulation through indirect supervision. By converting relations to natural language hypotheses, NBR is capable of exploiting semantic cues to alleviate annotation scarcity. By incorporating a ranking-based loss that implicitly calibrates abstinent instances, NBR learns a clearer decision boundary and is instructed to abstain on uncertain instances. Extensive experiments on three widely-used biomedical RE benchmarks, namely ChemProt, DDI and GAD, verify the effectiveness of NBR in both full-set and low-resource regimes. Our analysis demonstrates that indirect supervision benefits biomedical RE even when a domain gap ex...
Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models.
EN: In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality d...
Localising In-Domain Adaptation of Transformer-Based Biomedical Language Models.
EN: In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality d...
Dense Feature Memory Augmented Transformers for COVID-19 Vaccination Search Classification.
EN: With the devastating outbreak of COVID-19, vaccines are one of the crucial lines of defense against mass infection in this global pandemic. Given the protection they provide, vaccines are becoming mandatory in certain social and professional settings. This paper presents a classification model for detecting COVID-19 vaccination related search queries, a machine learning model that is used to generate search insights for COVID-19 vaccinations. The proposed method combines and leverages advancements from modern state-of-the-art (SOTA) natural language understanding (NLU) techniques such as pretrained Transformers with traditional dense features. We propose a novel approach of considering dense features as memory tokens that the model can attend to. We show that this new modeling approach enables a significant improvement to the Vaccine Search Insights (VSI) task, improving a strong well-established gradient-boosting baseline by relative +15% improvement in F1 score and +14% in precision.
Malaria Parasitic Detection using a New Deep Boosted and Ensemble Learning Framework.
EN: Malaria is a potentially fatal plasmodium parasite injected by female anopheles mosquitoes that infect red blood cells and millions worldwide yearly. However, specialists' manual screening in clinical practice is laborious and prone to error. Therefore, a novel Deep Boosted and Ensemble Learning (DBEL) framework, comprising the stacking of new Boosted-BR-STM convolutional neural networks (CNN) and the ensemble ML classifiers, is developed to screen malaria parasite images. The proposed Boosted-BR-STM is based on a new dilated-convolutional block-based split transform merge (STM) and feature-map Squeezing-Boosting (SB) ideas. Moreover, the new STM block uses regional and boundary operations to learn the malaria parasite's homogeneity, heterogeneity, and boundary with patterns. Furthermore, the diverse boosted channels are attained by employing Transfer Learning-based new feature-map SB in STM blocks at the abstract, medium, and conclusion levels to learn minute intensity and texture variation of the parasitic pattern. The proposed DBEL framework implicates the stacking of prominent and diverse boosted channels and provides the generated discriminative features of the developed Boost...
Re-evaluating sample efficiency in de novo molecule generation.
EN: De novo molecule generation can suffer from data inefficiency; requiring large amounts of training data or many sampled data points to conduct objective optimization. The latter is a particular disadvantage when combining deep generative models with computationally expensive molecule scoring functions (a.k.a. oracles) commonly used in computer-aided drug design. Recent works have therefore focused on methods to improve sample efficiency in the context of de novo molecule drug design, or to benchmark it. In this work, we discuss and adapt a recent sample efficiency benchmark to better reflect realistic goals also with respect to the quality of chemistry generated, which must always be considered in the context of small-molecule drug design; we then re-evaluate all benchmarked generative models. We find that accounting for molecular weight and LogP with respect to the training data, and the diversity of chemistry proposed, re-orders the ranking of generative models. In addition, we benchmark a recently proposed method to improve sample efficiency (Augmented Hill-Climb) and found it ranked top when considering both the sample efficiency and chemistry of molecules generated. Continual ...
Re-evaluating sample efficiency in de novo molecule generation.
EN: De novo molecule generation can suffer from data inefficiency; requiring large amounts of training data or many sampled data points to conduct objective optimization. The latter is a particular disadvantage when combining deep generative models with computationally expensive molecule scoring functions (a.k.a. oracles) commonly used in computer-aided drug design. Recent works have therefore focused on methods to improve sample efficiency in the context of de novo molecule drug design, or to benchmark it. In this work, we discuss and adapt a recent sample efficiency benchmark to better reflect realistic goals also with respect to the quality of chemistry generated, which must always be considered in the context of small-molecule drug design; we then re-evaluate all benchmarked generative models. We find that accounting for molecular weight and LogP with respect to the training data, and the diversity of chemistry proposed, re-orders the ranking of generative models. In addition, we benchmark a recently proposed method to improve sample efficiency (Augmented Hill-Climb) and found it ranked top when considering both the sample efficiency and chemistry of molecules generated. Continual ...
Re-evaluating sample efficiency in de novo molecule generation.
EN: De novo molecule generation can suffer from data inefficiency; requiring large amounts of training data or many sampled data points to conduct objective optimization. The latter is a particular disadvantage when combining deep generative models with computationally expensive molecule scoring functions (a.k.a. oracles) commonly used in computer-aided drug design. Recent works have therefore focused on methods to improve sample efficiency in the context of de novo molecule drug design, or to benchmark it. In this work, we discuss and adapt a recent sample efficiency benchmark to better reflect realistic goals also with respect to the quality of chemistry generated, which must always be considered in the context of small-molecule drug design; we then re-evaluate all benchmarked generative models. We find that accounting for molecular weight and LogP with respect to the training data, and the diversity of chemistry proposed, re-orders the ranking of generative models. In addition, we benchmark a recently proposed method to improve sample efficiency (Augmented Hill-Climb) and found it ranked top when considering both the sample efficiency and chemistry of molecules generated. Continual ...
DEL-Dock: Molecular Docking-Enabled Modeling of DNA-Encoded Libraries.
EN: DNA-Encoded Library (DEL) technology has enabled significant advances in hit identification by enabling efficient testing of combinatorially-generated molecular libraries. DEL screens measure protein binding affinity though sequencing reads of molecules tagged with unique DNA-barcodes that survive a series of selection experiments. Computational models have been deployed to learn the latent binding affinities that are correlated to the sequenced count data; however, this correlation is often obfuscated by various sources of noise introduced in its complicated data-generation process. In order to denoise DEL count data and screen for molecules with good binding affinity, computational models require the correct assumptions in their modeling structure to capture the correct signals underlying the data. Recent advances in DEL models have focused on probabilistic formulations of count data, but existing approaches have thus far been limited to only utilizing 2-D molecule-level representations. We introduce a new paradigm, DEL-Dock, that combines ligand-based descriptors with 3-D spatial information from docked protein-ligand complexes. 3-D spatial information allows our model to learn ...
Simple and Scalable Algorithms for Cluster-Aware Precision Medicine.
EN: AI-enabled precision medicine promises a transformational improvement in healthcare outcomes by enabling data-driven personalized diagnosis, prognosis, and treatment. However, the well-known "curse of dimensionality" and the clustered structure of biomedical data together interact to present a joint challenge in the high dimensional, limited observation precision medicine regime. To overcome both issues simultaneously we propose a simple and scalable approach to joint clustering and embedding that combines standard embedding methods with a convex clustering penalty in a modular way. This novel, cluster-aware embedding approach overcomes the complexity and limitations of current joint embedding and clustering methods, which we show with straightforward implementations of hierarchically clustered principal component analysis (PCA), locally linear embedding (LLE), and canonical correlation analysis (CCA). Through both numerical experiments and real-world examples, we demonstrate that our approach outperforms traditional and contemporary clustering methods on highly underdetermined problems (e.g., with just tens of observations) as well as on large sample datasets. Importantly, our app...
Reinforced Genetic Algorithm for Structure-based Drug Design.
EN: Structure-based drug design (SBDD) aims to discover drug candidates by finding molecules (ligands) that bind tightly to a disease-related protein (targets), which is the primary approach to computer-aided drug discovery. Recently, applying deep generative models for three-dimensional (3D) molecular design conditioned on protein pockets to solve SBDD has attracted much attention, but their formulation as probabilistic modeling often leads to unsatisfactory optimization performance. On the other hand, traditional combinatorial optimization methods such as genetic algorithms (GA) have demonstrated state-of-the-art performance in various molecular optimization tasks. However, they do not utilize protein target structure to inform design steps but rely on a random-walk-like exploration, which leads to unstable performance and no knowledge transfer between different tasks despite the similar binding physics. To achieve a more stable and efficient SBDD, we propose Reinforced Genetic Algorithm (RGA) that uses neural models to prioritize the profitable design steps and suppress random-walk behavior. The neural models take the 3D structure of the targets and ligands as inputs and are pre-tra...
Antibiotic-dependent instability of homeostatic plasticity for growth and environmental load.
EN: Reducing antibiotic usage in livestock animals has become an urgent issue worldwide to prevent antimicrobial resistance. Here, abuse of chlortetracycline (CTC), a versatile antibacterial agent, on the performance, blood components, fecal microbiota, and organic acid concentration in calves was investigated. Japanese Black calves were fed milk replacer containing CTC at 10 g/kg (CON) or 0 g/kg (EXP). Growth performance was not affected by CTC administration. However, CTC administration altered the correlation between fecal organic acids and bacterial genera. Machine learning methods such as association analysis, linear discriminant analysis, and energy landscape analysis revealed that CTC administration affected according to certain rules the population of various types of fecal bacteria. It is particularly interesting that the population of several methane-producing bacteria was high in the CON, and that of Lachnospiraceae, a butyrate-producing bacteria, was high in the EXP at 60 d of age. Furthermore, statistical causal inference based on machine learning data estimated that CTC treatment affects the entire intestinal environment, inhibiting butyrate production for growth and biol...
Detecting broken Absorber Tubes in CSP plants using intelligent sampling and dual loss.
EN: Concentrated solar power (CSP) is one of the growing technologies that is leading the process of changing from fossil fuels to renewable energies. The sophistication and size of the systems require an increase in maintenance tasks to ensure reliability, availability, maintainability and safety. Currently, automatic fault detection in CSP plants using Parabolic Trough Collector systems evidences two main drawbacks: 1) the devices in use needs to be manually placed near the receiver tube, 2) the Machine Learning-based solutions are not tested in real plants. We address both gaps by combining the data extracted with the use of an Unmaned Aerial Vehicle, and the data provided by sensors placed within 7 real plants. The resulting dataset is the first one of this type and can help to standardize research activities for the problem of fault detection in this type of plants. Our work proposes supervised machine-learning algorithms for detecting broken envelopes of the absorber tubes in CSP plants. The proposed solution takes the class imbalance problem into account, boosting the accuracy of the algorithms for the minority class without harming the overall performance of the models. For a D...
Improving dermatology classifiers across populations using images generated by large diffusion models.
EN: Dermatological classification algorithms developed without sufficiently diverse training data may generalize poorly across populations. While intentional data collection and annotation offer the best means for improving representation, new computational approaches for generating training data may also aid in mitigating the effects of sampling bias. In this paper, we show that DALL$\cdot$E 2, a large-scale text-to-image diffusion model, can produce photorealistic images of skin disease across skin types. Using the Fitzpatrick 17k dataset as a benchmark, we demonstrate that augmenting training data with DALL$\cdot$E 2-generated synthetic images improves classification of skin disease overall and especially for underrepresented groups.
DiffBP: Generative Diffusion of 3D Molecules for Target Protein Binding.
EN: Generating molecules that bind to specific proteins is an important but challenging task in drug discovery. Previous works usually generate atoms in an auto-regressive way, where element types and 3D coordinates of atoms are generated one by one. However, in real-world molecular systems, the interactions among atoms in an entire molecule are global, leading to the energy function pair-coupled among atoms. With such energy-based consideration, the modeling of probability should be based on joint distributions, rather than sequentially conditional ones. Thus, the unnatural sequentially auto-regressive modeling of molecule generation is likely to violate the physical rules, thus resulting in poor properties of the generated molecules. In this work, a generative diffusion model for molecular 3D structures based on target proteins as contextual constraints is established, at a full-atom level in a non-autoregressive way. Given a designated 3D protein binding site, our model learns the generative process that denoises both element types and 3D coordinates of an entire molecule, with an equivariant network. Experimentally, the proposed method shows competitive performance compared with pr...
Detecting Conspiracy Theory Against COVID-19 Vaccines.
EN: Since the beginning of the vaccination trial, social media has been flooded with anti-vaccination comments and conspiracy beliefs. As the day passes, the number of COVID- 19 cases increases, and online platforms and a few news portals entertain sharing different conspiracy theories. The most popular conspiracy belief was the link between the 5G network spreading COVID-19 and the Chinese government spreading the virus as a bioweapon, which initially created racial hatred. Although some disbelief has less impact on society, others create massive destruction. For example, the 5G conspiracy led to the burn of the 5G Tower, and belief in the Chinese bioweapon story promoted an attack on the Asian-Americans. Another popular conspiracy belief was that Bill Gates spread this Coronavirus disease (COVID-19) by launching a mass vaccination program to track everyone. This Conspiracy belief creates distrust issues among laypeople and creates vaccine hesitancy. This study aims to discover the conspiracy theory against the vaccine on social platforms. We performed a sentiment analysis on the 598 unique sample comments related to COVID-19 vaccines. We used two different models, BERT and Perspectiv...
Drug-target affinity prediction method based on consistent expression of heterogeneous data.
EN: The first step in drug discovery is finding drug molecule moieties with medicinal activity against specific targets. Therefore, it is crucial to investigate the interaction between drug-target proteins and small chemical molecules. However, traditional experimental methods for discovering potential small drug molecules are labor-intensive and time-consuming. There is currently a lot of interest in building computational models to screen small drug molecules using drug molecule-related databases. In this paper, we propose a method for predicting drug-target binding affinity using deep learning models. This method uses a modified GRU and GNN to extract features from the drug-target protein sequences and the drug molecule map, respectively, to obtain their feature vectors. The combined vectors are used as vector representations of drug-target molecule pairs and then fed into a fully connected network to predict drug-target binding affinity. This proposed model demonstrates its accuracy and effectiveness in predicting drug-target binding affinity on the DAVIS and KIBA datasets.
Grid-based state space exploration for molecular binding.
EN: Binding processes are difficult to sample with molecular-dynamics (MD) simulations. In particular, the state space exploration is often incomplete. Evaluating the molecular interaction energy on a grid circumvents this problem but is heavily limited by state space dimensionality. Here, we make the first steps towards a low-dimensional grid-based model of molecular binding. We discretise the state space of relative positions and orientations of the two molecules under the rigid body assumption.The corresponding program is published as the Python package molgri. For the rotational component of the grids, we test algorithms based on Euler angles, polyhedra and quaternions, of which the polyhedra-based are the most uniform. The program outputs a sequence of molecular structures that can be easily processed by standard MD programs to calculate grid point energies. We demonstrate the grid-based approach on two molecular systems: a water dimer and a coiled-coil protein interacting with a chloride anion. For the second system we relax the rigid-body assumption and improve the accuracy of the grid point energies by an energy minimisation. In both cases, oriented bonding patterns and energie...
BioNLI: Generating a Biomedical NLI Dataset Using Lexico-semantic Constraints for Adversarial Examples.
EN: Natural language inference (NLI) is critical for complex decision-making in biomedical domain. One key question, for example, is whether a given biomedical mechanism is supported by experimental evidence. This can be seen as an NLI problem but there are no directly usable datasets to address this. The main challenge is that manually creating informative negative examples for this task is difficult and expensive. We introduce a novel semi-supervised procedure that bootstraps an NLI dataset from existing biomedical dataset that pairs mechanisms with experimental evidence in abstracts. We generate a range of negative examples using nine strategies that manipulate the structure of the underlying mechanisms both with rules, e.g., flip the roles of the entities in the interaction, and, more importantly, as perturbations via logical constraints in a neuro-logical decoding system. We use this procedure to create a novel dataset for NLI in the biomedical domain, called BioNLI and benchmark two state-of-the-art biomedical classifiers. The best result we obtain is around mid 70s in F1, suggesting the difficulty of the task. Critically, the performance on the different classes of negative exam...
BioNLI: Generating a Biomedical NLI Dataset Using Lexico-semantic Constraints for Adversarial Examples.
EN: Natural language inference (NLI) is critical for complex decision-making in biomedical domain. One key question, for example, is whether a given biomedical mechanism is supported by experimental evidence. This can be seen as an NLI problem but there are no directly usable datasets to address this. The main challenge is that manually creating informative negative examples for this task is difficult and expensive. We introduce a novel semi-supervised procedure that bootstraps an NLI dataset from existing biomedical dataset that pairs mechanisms with experimental evidence in abstracts. We generate a range of negative examples using nine strategies that manipulate the structure of the underlying mechanisms both with rules, e.g., flip the roles of the entities in the interaction, and, more importantly, as perturbations via logical constraints in a neuro-logical decoding system. We use this procedure to create a novel dataset for NLI in the biomedical domain, called BioNLI and benchmark two state-of-the-art biomedical classifiers. The best result we obtain is around mid 70s in F1, suggesting the difficulty of the task. Critically, the performance on the different classes of negative exam...
The development of food protein-inorganic hybrid nanoflowers with outstanding role in stabilizing natural pigments.
EN: Protein-inorganic hybrid nanoflowers (HNFs) possess unique properties in promoting surface reaction and have attracted wide-spread attention as a newly developed nanomaterial. However, the availability of protein sources has up to now been mostly limited to enzymes, which narrows the application of HNFs especially in food industry. Here we show that for many types of food protein, enzymatic hydrolysis can improve its ability to form versatile HNFs, or even induce HNF formation where the protein source did not show its formation a priori. The treatment of enzymatic hydrolysis increases the flexibility of such proteins and induces nucleation sites of HNFs in the early formation stage by decomposing those proteins into polypeptides. In particular, the HNF prepared with soy protein hydrolysate further shows a high loading capacity of water-soluble Monascus red, reaching up to 554.1 mg per gram of HNF. Its stabilization towards lipophilic curcumin is similarly impressive with the loading capacity reaching 21.9 mg per gram of HNF. This HNF could also effectively protect these two sensitive natural pigments in harsh environments. This research significantly broadens the available protein ...
PlanT: Explainable Planning Transformers via Object-Level Representations.
EN: Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the ...
Structure-based Drug Design with Equivariant Diffusion Models.
EN: Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Generative SBDD methods leverage structural data of drugs in complex with their protein targets to propose new drug candidates. These approaches typically place one atom at a time in an autoregressive fashion using the binding pocket as well as previously added ligand atoms as context in each step. Recently a surge of diffusion generative models has entered this domain which hold promise to capture the statistical properties of natural ligands more faithfully. However, most existing methods focus exclusively on bottom-up de novo design of compounds or tackle other drug development challenges with task-specific models. The latter requires curation of suitable datasets, careful engineering of the models and retraining from scratch for each task. Here we show how a single pre-trained diffusion model can be applied to a broader range of problems, such as off-the-shelf property optimization, explicit negative design, and partial molecular design with inpainting. We formulate SBDD as a 3D-conditional generation problem and present DiffSB...
Structure-based drug design with geometric deep learning.
EN: Structure-based drug design uses three-dimensional geometric information of macromolecules, such as proteins or nucleic acids, to identify suitable ligands. Geometric deep learning, an emerging concept of neural-network-based machine learning, has been applied to macromolecular structures. This review provides an overview of the recent applications of geometric deep learning in bioorganic and medicinal chemistry, highlighting its potential for structure-based drug discovery and design. Emphasis is placed on molecular property prediction, ligand binding site and pose prediction, and structure-based de novo molecular design. The current challenges and opportunities are highlighted, and a forecast of the future of geometric deep learning for drug discovery is presented.
The Coming of Age of Nucleic Acid Vaccines during COVID-19.
EN: In the 21st century, several emergent viruses have posed a global threat. Each pathogen has emphasized the value of rapid and scalable vaccine development programs. The ongoing SARS-CoV-2 pandemic has made the importance of such efforts especially clear. New biotechnological advances in vaccinology allow for recent advances that provide only the nucleic acid building blocks of an antigen, eliminating many safety concerns. During the COVID-19 pandemic, these DNA and RNA vaccines have facilitated the development and deployment of vaccines at an unprecedented pace. This success was attributable at least in part to broader shifts in scientific research relative to prior epidemics; the genome of SARS-CoV-2 was available as early as January 2020, facilitating global efforts in the development of DNA and RNA vaccines within two weeks of the international community becoming aware of the new viral threat. Additionally, these technologies that were previously only theoretical are not only safe but also highly efficacious. Although historically a slow process, the rapid development of vaccines during the COVID-19 crisis reveals a major shift in vaccine technologies. Here, we provide historica...
E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking.
EN: In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on blind flexible selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standar...
Metabolic Model-based Ecological Modeling for Probiotic Design.
EN: The microbial community composition in the human gut has a profound effect on human health. This observation has lead to extensive use of microbiome therapies, including over-the-counter ``probiotic" treatments intended to alter the composition of the microbiome. Despite so much promise and commercial interest, the factors that contribute to the success or failure of microbiome-targeted treatments remain unclear. We investigate the biotic interactions that lead to successful engraftment of a novel bacterial strain introduced to the microbiome as in probiotic treatments. We use pairwise genome-scale metabolic modeling with a generalized resource allocation constraint to build a network of interactions between 818 species with well developed models available in the AGORA database. We create induced sub-graphs using the taxa present in samples from three experimental engraftment studies and assess the likelihood of invader engraftment based on network structure. To do so, we use a set of dynamical models designed to reflect connect network topology to growth dynamics. We show that a generalized Lotka-Volterra model has strong ability to predict if a particular invader or probiotic wil...
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking.
EN: Predicting the binding structure of a small molecule ligand to a protein -- a task known as molecular docking -- is critical to drug design. Recent deep learning methods that treat docking as a regression problem have decreased runtime compared to traditional search-based methods but have yet to offer substantial improvements in accuracy. We instead frame molecular docking as a generative modeling problem and develop DiffDock, a diffusion generative model over the non-Euclidean manifold of ligand poses. To do so, we map this manifold to the product space of the degrees of freedom (translational, rotational, and torsional) involved in docking and develop an efficient diffusion process on this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on PDBBind, significantly outperforming the previous state-of-the-art of traditional docking (23%) and deep learning (20%) methods. Moreover, while previous methods are not able to dock on computationally folded structures (maximum accuracy 10.4%), DiffDock maintains significantly higher precision (21.7%). Finally, DiffDock has fast inference times and provides confidence estimates with high selective accuracy.
State-specific protein-ligand complex structure prediction with a multi-scale deep generative model.
EN: The binding complexes formed by proteins and small molecule ligands are ubiquitous and critical to life. Despite recent advancements in protein structure prediction, existing algorithms are so far unable to systematically predict the binding ligand structures along with their regulatory effects on protein folding. To address this discrepancy, we present NeuralPLexer, a computational approach that can directly predict protein-ligand complex structures solely using protein sequence and ligand molecular graph inputs. NeuralPLexer adopts a deep generative model to sample the 3D structures of the binding complex and their conformational changes at an atomistic resolution. The model is based on a diffusion process that incorporates essential biophysical constraints and a multi-scale geometric deep learning system to iteratively sample residue-level contact maps and all heavy-atom coordinates in a hierarchical manner. NeuralPLexer achieves state-of-the-art performance compared to all existing methods on benchmarks for both protein-ligand blind docking and flexible binding site structure recovery. Moreover, owing to its specificity in sampling both ligand-free-state and ligand-bound-state ...
SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
EN: Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
SPICE, A Dataset of Drug-like Molecules and Peptides for Training Machine Learning Potentials.
EN: Machine learning potentials are an important tool for molecular simulation, but their development is held back by a shortage of high quality datasets to train them on. We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids. It includes 15 elements, charged and uncharged molecules, and a wide range of covalent and non-covalent interactions. It provides both forces and energies calculated at the ωB97M-D3(BJ)/def2-TZVPPD level of theory, along with other useful quantities such as multipole moments and bond orders. We train a set of machine learning potentials on it and demonstrate that they can achieve chemical accuracy across a broad region of chemical space. It can serve as a valuable resource for the creation of transferable, ready to use potential functions for use in molecular simulations.
ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery.
EN: The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we introduce ImDrug, a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis. We conduct extensive empirical studies with novel evaluation metrics, to demonstrate that the existing algorithms fall short of solving medicinal and pharmaceutical challenges in the data imbalance scenario. We believe that ImDrug opens up avenues for future research and development, on real-world challenges at the intersection of AIDD and deep imbalanced learning.
Low energy electron interactions with resveratrol and resorcinol: anion states and likely dissociation pathways.
EN: We report a computational study of the anion states of the resveratrol (RV) and resorcinol (RS) molecules, also investigating dissociative electron attachment (DEA) pathways. RV has well known beneficial effects in human health, and its antioxidant activity was previously associated with DEA reactions producing H$_2$. Our calculations indicate a valence bound state ($π^_1$) and four resonances ($π^_2$ to $π^_5$) for that system. While the computed thermodynamical thresholds are compatible with DEA reactions producing H$_2$ at 0~eV, the well known mechanism involving vibrational Feshbach resonances built on a dipole bound state should not be present in RV. Our results suggest that the shallow $π^_1$ valence bound state is expected to account for H$_2$ elimination, probably involving $π_1^$/$σ_{\text{OH}}^$ couplings along the vibration dynamics. The RS molecule is also an oxidant and a subunit of RV. Since two close-lying hydroxyl groups are found in the RS moiety, the H$_2$-elimination reaction in RV should take place at the RS site. Our calculations point out a correspondence between the anion states of RV and RS, and even between the thresholds. Nevertheless, the absence of...
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language.
EN: Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language desc...
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language.
EN: Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language desc...
A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language.
EN: Although artificial intelligence (AI) has made significant progress in understanding molecules in a wide range of fields, existing models generally acquire the single cognitive ability from the single molecular modality. Since the hierarchy of molecular knowledge is profound, even humans learn from different modalities including both intuitive diagrams and professional texts to assist their understanding. Inspired by this, we propose a molecular multimodal foundation model which is pretrained from molecular graphs and their semantically related textual data (crawled from published Scientific Citation Index papers) via contrastive learning. This AI model represents a critical attempt that directly bridges molecular graphs and natural language. Importantly, through capturing the specific and complementary information of the two modalities, our proposed model can better grasp molecular expertise. Experimental results show that our model not only exhibits promising performance in cross-modal tasks such as cross-modal retrieval and molecule caption, but also enhances molecular property prediction and possesses capability to generate meaningful molecular graphs from natural language desc...
On the Effectiveness of Compact Biomedical Transformers.
EN: Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension, and number of layers. The natural language processing (NLP) community has developed numerous strategies to compress these models utilising techniques such as pruning, quantisation, and knowledge distillation, resulting in models that are considerably faster, smaller, and subsequently easier to use in practice. By the same token, in this paper we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset via the Masked Language Modelling (MLM) objective. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create efficient lightweight models that perform on par with their larger counterparts. All the models will be publicly available on our Huggingface profi...
On the Effectiveness of Compact Biomedical Transformers.
EN: Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension, and number of layers. The natural language processing (NLP) community has developed numerous strategies to compress these models utilising techniques such as pruning, quantisation, and knowledge distillation, resulting in models that are considerably faster, smaller, and subsequently easier to use in practice. By the same token, in this paper we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT, and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset via the Masked Language Modelling (MLM) objective. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create efficient lightweight models that perform on par with their larger counterparts. All the models will be publicly available on our Huggingface profi...
Plant Species Classification Using Transfer Learning by Pretrained Classifier VGG-19.
EN: Deep learning is currently the most important branch of machine learning, with applications in speech recognition, computer vision, image classification, and medical imaging analysis. Plant recognition is one of the areas where image classification can be used to identify plant species through their leaves. Botanists devote a significant amount of time to recognizing plant species by personally inspecting. This paper describes a method for dissecting color images of Swedish leaves and identifying plant species. To achieve higher accuracy, the task is completed using transfer learning with the help of pre-trained classifier VGG-19. The four primary processes of classification are image preprocessing, image augmentation, feature extraction, and recognition, which are performed as part of the overall model evaluation. The VGG-19 classifier grasps the characteristics of leaves by employing pre-defined hidden layers such as convolutional layers, max pooling layers, and fully connected layers, and finally uses the soft-max layer to generate a feature representation for all plant classes. The model obtains knowledge connected to aspects of the Swedish leaf dataset, which contains fifteen ...
SPIM-Flow, an integrated light-sheet and microfluidics platform for hydrodynamic studies of Hydra.
EN: Selective plane illumination microscopy (SPIM), or light sheet, is a powerful three-dimensional imaging approach. However, access to and interfacing microscopes with microfluidics have remained challenging. Complex interfacing with microfluidics has limited the SPIM's utility in studying the hydrodynamics of freely moving multicellular organisms. We developed SPIM-Flow, an inexpensive light sheet platform that enables easy integration with microfluidics. We used SPIM-Flow to study the hydrodynamics of freely moving Hydra polyps in millimeter-sized chambers (4 mm wide, 1.5 mm height). Our initial experiments across multiple animals, feeding on a chip (Artemia franciscana nauplius used as food), and baseline behaviors (eg., tentacle swaying, elongation, and bending) indicated animals' health inside the system. SPIM enabled easy imaging of the freely moving animal and tracer beads (for fluid visualizations) inside the larger chambers. Next, using the chambers, we investigated Hydra's response to flow. Results suggest that animals responded to established flow by bending and swaying their tentacles in the flow direction. Finally, we used a previously described video analysis software (...
A survey, review, and future trends of skin lesion segmentation and classification.
EN: The Computer-aided Diagnosis or Detection (CAD) approach for skin lesion analysis is an emerging field of research that has the potential to alleviate the burden and cost of skin cancer screening. Researchers have recently indicated increasing interest in developing such CAD systems, with the intention of providing a user-friendly tool to dermatologists to reduce the challenges encountered or associated with manual inspection. This article aims to provide a comprehensive literature survey and review of a total of 594 publications (356 for skin lesion segmentation and 238 for skin lesion classification) published between 2011 and 2022. These articles are analyzed and summarized in a number of different ways to contribute vital information regarding the methods for the development of CAD systems. These ways include relevant and essential definitions and theories, input data (dataset utilization, preprocessing, augmentations, and fixing imbalance problems), method configuration (techniques, architectures, module frameworks, and losses), training tactics (hyperparameter settings), and evaluation criteria. We intend to investigate a variety of performance-enhancing approaches, including...
Federated Self-Supervised Contrastive Learning and Masked Autoencoder for Dermatological Disease Diagnosis.
EN: In dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients. Federated learning (FL) can use decentralized data to train models while keeping data local. Existing FL methods assume all the data have labels. However, medical data often comes without full labels due to high labeling costs. Self-supervised learning (SSL) methods, contrastive learning (CL) and masked autoencoders (MAE), can leverage the unlabeled data to pre-train models, followed by fine-tuning with limited labels. However, combining SSL and FL has unique challenges. For example, CL requires diverse data but each device only has limited data. For MAE, while Vision Transformer (ViT) based MAE has higher accuracy over CNNs in centralized learning, MAE's performance in FL with unlabeled data has not been investigated. Besides, the ViT synchronization between the server and clients is different from traditional CNNs. Therefore, special synchronization methods need to be designed. In this work, we propose two federated self-supervised learning frameworks for dermatological disease diagnosis with limited labels. The first one features lower...
From Static to Dynamic Structures: Improving Binding Affinity Prediction with Graph-Based Deep Learning.
EN: Accurate prediction of protein-ligand binding affinities is an essential challenge in structure-based drug design. Despite recent advances in data-driven methods for affinity prediction, their accuracy is still limited, partially because they only take advantage of static crystal structures while the actual binding affinities are generally determined by the thermodynamic ensembles between proteins and ligands. One effective way to approximate such a thermodynamic ensemble is to use molecular dynamics (MD) simulation. Here, an MD dataset containing 3,218 different protein-ligand complexes is curated, and Dynaformer, a graph-based deep learning model is further developed to predict the binding affinities by learning the geometric characteristics of the protein-ligand interactions from the MD trajectories. In silico experiments demonstrated that the model exhibits state-of-the-art scoring and ranking power on the CASF-2016 benchmark dataset, outperforming the methods hitherto reported. Moreover, in a virtual screening on heat shock protein 90 (HSP90) using Dynaformer, 20 candidates are identified and their binding affinities are further experimentally validated. Dynaformer displayed p...
Harnessing the polymer-particle duality of ultra-soft nanogels to stabilise smart emulsions.
EN: Micro- and nanogels are widely used to stabilise emulsions and simultaneously implement their responsiveness to the external stimuli. One of the factors that improves the emulsion stability is the nanogel softness. Here, we study how the softest nanogels that can be synthesised with precipitation polymerisation of N-isopropylacrylamide (NIPAM), the ultra-low crosslinked (ULC) nanogels, stabilise oil-in-water emulsions. We show that ULC nanogels can efficiently stabilise emulsions already at low mass concentrations. These emulsions are resistant to droplet flocculation, stable against coalescence, and can be easily broken upon an increase in temperature. The resistance to flocculation of the ULC-stabilised emulsion droplets is similar to the one of emulsions stabilised by linear pNIPAM. In contrast, the stability against coalescence and the temperature-responsiveness closely resemble the one of emulsions stabilised by regularly crosslinked pNIPAM nanogels. The reason for this combination of properties is that ULC nanogels can be thought of as colloids in between flexible macromolecules and particles. As a polymer, ULC nanogels can efficiently stretch at the interface and cover it un...
Widely Used and Fast De Novo Drug Design by a Protein Sequence-Based Reinforcement Learning Model.
EN: De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addition, the cost of molecular generation is computationally expensive due to 3D representations of both molecule and protein. Here, we demonstrate a widely used and fast protein sequence-based reinforcement learning (RL) model for drug discovery. In the generative model, one of the reward components, a binding affinity predictor, is based on 1D protein sequence and molecular SMILES. As a proof of concept, the RL model was utilized to design molecules for four targets. The generated compounds showed bioactivities by the validation of both QSAR and molecular docking with experimental 3D binding pockets. We also found that the performance of generated molecules depends on the selection of data source training for the binding predictor. Furthermore, drug design for a kinase without any experimen...
Localization and Classification of Parasitic Eggs in Microscopic Images Using an EfficientDet Detector.
EN: IPIs caused by protozoan and helminth parasites are among the most common infections in humans in LMICs. They are regarded as a severe public health concern, as they cause a wide array of potentially detrimental health conditions. Researchers have been developing pattern recognition techniques for the automatic identification of parasite eggs in microscopic images. Existing solutions still need improvements to reduce diagnostic errors and generate fast, efficient, and accurate results. Our paper addresses this and proposes a multi-modal learning detector to localize parasitic eggs and categorize them into 11 categories. The experiments were conducted on the novel Chula-ParasiteEgg-11 dataset that was used to train both EfficientDet model with EfficientNet-v2 backbone and EfficientNet-B7+SVM. The dataset has 11,000 microscopic training images from 11 categories. Our results show robust performance with an accuracy of 92%, and an F1 score of 93%. Additionally, the IOU distribution illustrates the high localization capability of the detector.
CIPCaD-Bench: Continuous Industrial Process datasets for benchmarking Causal Discovery methods.
EN: Causal relationships are commonly examined in manufacturing processes to support faults investigations, perform interventions, and make strategic decisions. Industry 4.0 has made available an increasing amount of data that enable data-driven Causal Discovery (CD). Considering the growing number of recently proposed CD methods, it is necessary to introduce strict benchmarking procedures on publicly available datasets since they represent the foundation for a fair comparison and validation of different methods. This work introduces two novel public datasets for CD in continuous manufacturing processes. The first dataset employs the well-known Tennessee Eastman simulator for fault detection and process control. The second dataset is extracted from an ultra-processed food manufacturing plant, and it includes a description of the plant, as well as multiple ground truths. These datasets are used to propose a benchmarking procedure based on different metrics and evaluated on a wide selection of CD algorithms. This work allows testing CD methods in realistic conditions enabling the selection of the most suitable method for specific target applications. The datasets are available at the fol...
VacciNet: Towards a Smart Framework for Learning the Distribution Chain Optimization of Vaccines for a Pandemic.
EN: Vaccinations against viruses have always been the need of the hour since long past. However, it is hard to efficiently distribute the vaccines (on time) to all the corners of a country, especially during a pandemic. Considering the vastness of the population, diversified communities, and demands of a smart society, it is an important task to optimize the vaccine distribution strategy in any country/state effectively. Although there is a profusion of data (Big Data) from various vaccine administration sites that can be mined to gain valuable insights about mass vaccination drives, very few attempts has been made towards revolutionizing the traditional mass vaccination campaigns to mitigate the socio-economic crises of pandemic afflicted countries. In this paper, we bridge this gap in studies and experimentation. We collect daily vaccination data which is publicly available and carefully analyze it to generate meaning-full insights and predictions. We put forward a novel framework leveraging Supervised Learning and Reinforcement Learning (RL) which we call VacciNet, that is capable of learning to predict the demand of vaccination in a state of a country as well as suggest optimal vac...
Dynamics and triggers of misinformation on vaccines.
EN: The Covid-19 pandemic has sparked renewed attention on the prevalence of misinformation online, whether intentional or not, underscoring the potential risks posed to individuals' quality of life associated with the dissemination of misconceptions and enduring myths on health-related subjects. In this study, we analyze 6 years (2016-2021) of Italian vaccine debate across diverse social media platforms (Facebook, Instagram, Twitter, YouTube), encompassing all major news sources - both questionable and reliable. We first use the symbolic transfer entropy analysis of news production time-series to dynamically determine which category of sources, questionable or reliable, causally drives the agenda on vaccines. Then, leveraging deep learning models capable to accurately classify vaccine-related content based on the conveyed stance and discussed topic, respectively, we evaluate the focus on various topics by news sources promoting opposing views and compare the resulting user engagement. Aside from providing valuable resources for further investigation of vaccine-related misinformation, particularly in a language (Italian) that receives less attention in scientific research compared to l...
Selectivity in single-molecule reactions by tip-induced redox chemistry.
EN: Controlling selectivity of reactions is a quest in chemistry. Here, we demonstrate reversible and selective bond formation and dissociation promoted by tip-induced reduction-oxidation reactions on a surface. Molecular rearrangements leading to different constitutional isomers are selected by the polarity and magnitude of applied voltage pulses from the tip of a combined scanning tunneling and atomic force microscope. Characterization of voltage dependence of the reactions and determination of reaction rates demonstrate selectivity in constitutional isomerization reactions and provide insight into the underlying mechanisms. With support of density functional theory calculations, we find that the energy landscape of the isomers in different charge states is important to rationalize the selectivity. Tip-induced selective single-molecule reactions increase our understanding of redox chemistry and could lead to novel molecular machines.
Selectivity in single-molecule reactions by tip-induced redox chemistry.
EN: Controlling selectivity of reactions is a quest in chemistry. Here, we demonstrate reversible and selective bond formation and dissociation promoted by tip-induced reduction-oxidation reactions on a surface. Molecular rearrangements leading to different constitutional isomers are selected by the polarity and magnitude of applied voltage pulses from the tip of a combined scanning tunneling and atomic force microscope. Characterization of voltage dependence of the reactions and determination of reaction rates demonstrate selectivity in constitutional isomerization reactions and provide insight into the underlying mechanisms. With support of density functional theory calculations, we find that the energy landscape of the isomers in different charge states is important to rationalize the selectivity. Tip-induced selective single-molecule reactions increase our understanding of redox chemistry and could lead to novel molecular machines.
Selectivity in single-molecule reactions by tip-induced redox chemistry.
EN: Controlling selectivity of reactions is a quest in chemistry. Here, we demonstrate reversible and selective bond formation and dissociation promoted by tip-induced reduction-oxidation reactions on a surface. Molecular rearrangements leading to different constitutional isomers are selected by the polarity and magnitude of applied voltage pulses from the tip of a combined scanning tunneling and atomic force microscope. Characterization of voltage dependence of the reactions and determination of reaction rates demonstrate selectivity in constitutional isomerization reactions and provide insight into the underlying mechanisms. With support of density functional theory calculations, we find that the energy landscape of the isomers in different charge states is important to rationalize the selectivity. Tip-induced selective single-molecule reactions increase our understanding of redox chemistry and could lead to novel molecular machines.
Selectivity in single-molecule reactions by tip-induced redox chemistry.
EN: Controlling selectivity of reactions is a quest in chemistry. Here, we demonstrate reversible and selective bond formation and dissociation promoted by tip-induced reduction-oxidation reactions on a surface. Molecular rearrangements leading to different constitutional isomers are selected by the polarity and magnitude of applied voltage pulses from the tip of a combined scanning tunneling and atomic force microscope. Characterization of voltage dependence of the reactions and determination of reaction rates demonstrate selectivity in constitutional isomerization reactions and provide insight into the underlying mechanisms. With support of density functional theory calculations, we find that the energy landscape of the isomers in different charge states is important to rationalize the selectivity. Tip-induced selective single-molecule reactions increase our understanding of redox chemistry and could lead to novel molecular machines.
Explainable AI (XAI) in Biomedical Signal and Image Processing: Promises and Challenges.
EN: Artificial intelligence has become pervasive across disciplines and fields, and biomedical image and signal processing is no exception. The growing and widespread interest on the topic has triggered a vast research activity that is reflected in an exponential research effort. Through study of massive and diverse biomedical data, machine and deep learning models have revolutionized various tasks such as modeling, segmentation, registration, classification and synthesis, outperforming traditional techniques. However, the difficulty in translating the results into biologically/clinically interpretable information is preventing their full exploitation in the field. Explainable AI (XAI) attempts to fill this translational gap by providing means to make the models interpretable and providing explanations. Different solutions have been proposed so far and are gaining increasing interest from the community. This paper aims at providing an overview on XAI in biomedical data processing and points to an upcoming Special Issue on Deep Learning in Biomedical Image and Signal Processing of the IEEE Signal Processing Magazine that is going to appear in March 2022.
Explainable AI (XAI) in Biomedical Signal and Image Processing: Promises and Challenges.
EN: Artificial intelligence has become pervasive across disciplines and fields, and biomedical image and signal processing is no exception. The growing and widespread interest on the topic has triggered a vast research activity that is reflected in an exponential research effort. Through study of massive and diverse biomedical data, machine and deep learning models have revolutionized various tasks such as modeling, segmentation, registration, classification and synthesis, outperforming traditional techniques. However, the difficulty in translating the results into biologically/clinically interpretable information is preventing their full exploitation in the field. Explainable AI (XAI) attempts to fill this translational gap by providing means to make the models interpretable and providing explanations. Different solutions have been proposed so far and are gaining increasing interest from the community. This paper aims at providing an overview on XAI in biomedical data processing and points to an upcoming Special Issue on Deep Learning in Biomedical Image and Signal Processing of the IEEE Signal Processing Magazine that is going to appear in March 2022.
Towards Transparency in Dermatology Image Datasets with Skin Tone Annotations by Experts, Crowds, and an Algorithm.
EN: While artificial intelligence (AI) holds promise for supporting healthcare providers and improving the accuracy of medical diagnoses, a lack of transparency in the composition of datasets exposes AI models to the possibility of unintentional and avoidable mistakes. In particular, public and private image datasets of dermatological conditions rarely include information on skin color. As a start towards increasing transparency, AI researchers have appropriated the use of the Fitzpatrick skin type (FST) from a measure of patient photosensitivity to a measure for estimating skin tone in algorithmic audits of computer vision applications including facial recognition and dermatology diagnosis. In order to understand the variability of estimated FST annotations on images, we compare several FST annotation methods on a diverse set of 460 images of skin conditions from both textbooks and online dermatology atlases. We find the inter-rater reliability between three board-certified dermatologists is comparable to the inter-rater reliability between the board-certified dermatologists and two crowdsourcing methods. In contrast, we find that the Individual Typology Angle converted to FST (ITA-FS...
Quantitative Assessment of DESIS Hyperspectral Data for Plant Biodiversity Estimation in Australia.
EN: Diversity of terrestrial plants plays a key role in maintaining a stable, healthy, and productive ecosystem. Though remote sensing has been seen as a promising and cost-effective proxy for estimating plant diversity, there is a lack of quantitative studies on how confidently plant diversity can be inferred from spaceborne hyperspectral data. In this study, we assessed the ability of hyperspectral data captured by the DLR Earth Sensing Imaging Spectrometer (DESIS) for estimating plant species richness in the Southern Tablelands and Snowy Mountains regions in southeast Australia. Spectral features were firstly extracted from DESIS spectra with principal component analysis, canonical correlation analysis, and partial least squares analysis. Then regression was conducted between the extracted features and plant species richness with ordinary least squares regression, kernel ridge regression, and Gaussian process regression. Results were assessed with the coefficient of correlation ($r$) and Root-Mean-Square Error (RMSE), based on a two-fold cross validation scheme. With the best performing model, $r$ is 0.71 and RMSE is 5.99 for the Southern Tablelands region, while $r$ is 0.62 and RMS...
Structural aspects of the clustering of curcumin molecules in water. Molecular dynamics computer simulation study.
EN: We explore clustering of curcumin molecules in water by using the OPLS-UA model for the enol conformer of curcumin (J. Mol. Liq., 223, 707, 2016) and the SPC-E water model. With this purpose, solutions of 2, 4, 8, 12, 16 and 20 curcumin molecules in 3000 water molecules are studied by using extensive molecular dynamics computer simulations. Radial distributions for the centers of mass of curcumin molecules are evaluated and the running coordination numbers are analyzed. The formation of clusters on time is elucidated. The internal structure of molecules within the cluster is described by using radial distributions of the elements of the curcumin molecule, the orientation descriptors, the order parameter and the radius of gyration. The self-diffusion coefficient of solute molecules in clusters is evaluated. The distribution of water species around clusters is described in detail. A comparison of our findings with computer simulation results of other authors is performed. A possibility to relate predictions of the model with experimental observations is discussed.
BOTAN: BOnd TArgeting Network for prediction of slow glassy dynamics by machine learning relative motion.
EN: Recent developments in machine learning have enabled accurate predictions of the dynamics of slow structural relaxation in glass-forming systems. However, existing machine-learning models for these tasks are mostly designed such that they learn a single dynamic quantity and relate it to the structural features of glassy liquids. In this study, we propose a graph neural network model, ``BOnd TArgeting Network (BOTAN)'', that learns relative motion between neighboring pairs of particles, in addition to the self-motion of particles. By relating the structural features to these two different dynamical variables, the model autonomously acquires the ability to discern how different dynamical processes, strain fluctuations and particle rearrangements, affect the self-motion of particles undergoing slow relaxation, and thus can predict with high precision how slow structural relaxation develops in space and time.
Inferring the stability of concentrated emulsions from droplet configuration information.
EN: When droplets are tightly packed in a 2D microchannel, coalescence of a pair of droplets can trigger an avalanche of coalescence events that propagate through the entire emulsion. This propagation is found to be stochastic, i.e. every coalescence event does not necessarily trigger another. To study how the local probabilistic propagation affects the dynamics of the avalanche, as a whole, a stochastic agent based model is used. Taking as input, i) how the droplets are packed (configuration) and ii) a measure of local probabilistic propagation (experimentally derived; function of fluid and other system parameters), the model predicts the expected size distribution of avalanches. In this article, we investigate how droplet configuration affects the avalanche dynamics. We find the mean size of these avalanches to depend non-trivially on how droplets are packed together. Large variations in the avalanche dynamics are observed when droplet packing are different, even when the other system properties (number of droplets, fluid properties, channel geometry, etc.) are kept constant. Bidisperse emulsions show less variation in the dynamics and they are surprisingly more stable than monodispe...
"Double vaccinated, 5G boosted!": Learning Attitudes towards COVID-19 Vaccination from Social Media.
EN: To address the vaccine hesitancy which impairs the efforts of the COVID-19 vaccination campaign, it is imperative to understand public vaccination attitudes and timely grasp their changes. In spite of reliability and trustworthiness, conventional attitude collection based on surveys is time-consuming and expensive, and cannot follow the fast evolution of vaccination attitudes. We leverage the textual posts on social media to extract and track users' vaccination stances in near real time by proposing a deep learning framework. To address the impact of linguistic features such as sarcasm and irony commonly used in vaccine-related discourses, we integrate into the framework the recent posts of a user's social network neighbours to help detect the user's genuine attitude. Based on our annotated dataset from Twitter, the models instantiated from our framework can increase the performance of attitude extraction by up to 23% compared to state-of-the-art text-only models. Using this framework, we successfully validate the feasibility of using social media to track the evolution of vaccination attitudes in real life. We further show one practical use of our framework by validating the possi...
A fully differentiable ligand pose optimization framework guided by deep learning and traditional scoring functions.
EN: The machine learning (ML) and deep learning (DL) techniques are widely recognized to be powerful tools for virtual drug screening. The recently reported ML- or DL-based scoring functions have shown exciting performance in predicting protein-ligand binding affinities with fruitful application prospects. However, the differentiation between highly similar ligand conformations, including the native binding pose (the global energy minimum state), remains challenging which could greatly enhance the docking. In this work, we propose a fully differentiable framework for ligand pose optimization based on a hybrid scoring function (SF) combined with a multi-layer perceptron (DeepRMSD) and the traditional AutoDock Vina SF. The DeepRMSD+Vina, which combines (1) the root mean square deviation (RMSD) of the docking pose with respect to the native pose and (2) the AutoDock Vina score, is fully differentiable thus is capable of optimizing the ligand binding pose to the energy-lowest conformation. Evaluated by the CASF-2016 docking power dataset, the DeepRMSD+Vina reaches a success rate of 95.4%, which is by far the best reported SF to date. Based on this SF, an end-to-end ligand pose optimization...
A Multilingual Dataset of COVID-19 Vaccination Attitudes on Twitter.
EN: Vaccine hesitancy is considered as one main cause of the stagnant uptake ratio of COVID-19 vaccines in Europe and the US where vaccines are sufficiently supplied. Fast and accurate grasp of public attitudes toward vaccination is critical to address vaccine hesitancy, and social media platforms have proved to be an effective source of public opinions. In this paper, we describe the collection and release of a dataset of tweets related to COVID-19 vaccines. This dataset consists of the IDs of 2,198,090 tweets collected from Western Europe, 17,934 of which are annotated with the originators' vaccination stances. Our annotation will facilitate using and developing data-driven models to extract vaccination attitudes from social media posts and thus further confirm the power of social media in public health surveillance. To lay the groundwork for future research, we not only perform statistical analysis and visualisation of our dataset, but also evaluate and compare the performance of established text-based benchmarks in vaccination stance extraction. We demonstrate one potential use of our data in practice in tracking the temporal changes of public COVID-19 vaccination attitudes.
HyGNN: Drug-Drug Interaction Prediction via Hypergraph Neural Network.
EN: Drug-Drug Interactions (DDIs) may hamper the functionalities of drugs, and in the worst scenario, they may lead to adverse drug reactions (ADRs). Predicting all DDIs is a challenging and critical problem. Most existing computational models integrate drug-centric information from different sources and leverage them as features in machine learning classifiers to predict DDIs. However, these models have a high chance of failure, especially for the new drugs when all the information is not available. This paper proposes a novel Hypergraph Neural Network (HyGNN) model based on only the SMILES string of drugs, available for any drug, for the DDI prediction problem. To capture the drug similarities, we create a hypergraph from drugs' chemical substructures extracted from the SMILES strings. Then, we develop HyGNN consisting of a novel attention-based hypergraph edge encoder to get the representation of drugs as hyperedges and a decoder to predict the interactions between drug pairs. Furthermore, we conduct extensive experiments to evaluate our model and compare it with several state-of-the-art methods. Experimental results demonstrate that our proposed HyGNN model effectively predicts DDI...
Stability versus Meta-stability in a Skin Microbiome Model.
EN: The skin microbiome plays an important role in the maintenance of a healthy skin. It is an ecosystem, composed of several species, competing for resources and interacting with the skin cells. Imbalance in the cutaneous microbiome, also called dysbiosis, has been correlated with several skin conditions, including acne and atopic dermatitis. Generally, dysbiosis is linked to colonization of the skin by a population of opportunistic pathogenic bacteria (for example C. acnes in acne or S. aureus in atopic dermatitis). Treatments consisting in non-specific elimination of cutaneous microflora have shown conflicting results. It is therefore necessary to understand the factors influencing shifts of the skin microbiome composition. In this work, we introduce a mathematical model based on ordinary differential equations, with 2 types of bacteria populations (skin commensals and opportunistic pathogens) to study the mechanisms driving the dominance of one population over the other. By using published experimental data, assumed to correspond to the observation of stable states in our model, we derive constraints that allow us to reduce the number of parameters of the model from 13 to 5. Intere...
Constrained Submodular Optimization for Vaccine Design.
EN: Advances in machine learning have enabled the prediction of immune system responses to prophylactic and therapeutic vaccines. However, the engineering task of designing vaccines remains a challenge. In particular, the genetic variability of the human immune system makes it difficult to design peptide vaccines that provide widespread immunity in vaccinated populations. We introduce a framework for evaluating and designing peptide vaccines that uses probabilistic machine learning models, and demonstrate its ability to produce designs for a SARS-CoV-2 vaccine that outperform previous designs. We provide a theoretical analysis of the approximability, scalability, and complexity of our framework.
Neural interval-censored survival regression with feature selection.
EN: Survival analysis is a fundamental area of focus in biomedical research, particularly in the context of personalized medicine. This prominence is due to the increasing prevalence of large and high-dimensional datasets, such as omics and medical image data. However, the literature on non-linear regression algorithms and variable selection techniques for interval-censoring is either limited or non-existent, particularly in the context of neural networks. Our objective is to introduce a novel predictive framework tailored for interval-censored regression tasks, rooted in Accelerated Failure Time (AFT) models. Our strategy comprises two key components: i) a variable selection phase leveraging recent advances on sparse neural network architectures, ii) a regression model targeting prediction of the interval-censored response. To assess the performance of our novel algorithm, we conducted a comprehensive evaluation through both numerical experiments and real-world applications that encompass scenarios related to diabetes and physical activity. Our results outperform traditional AFT algorithms, particularly in scenarios featuring non-linear relationships.
SHREC 2022: Protein-ligand binding site recognition.
EN: This paper presents the methods that have participated in the SHREC 2022 contest on protein-ligand binding site recognition. The prediction of protein-ligand binding regions is an active research domain in computational biophysics and structural biology and plays a relevant role for molecular docking and drug design. The goal of the contest is to assess the effectiveness of computational methods in recognizing ligand binding sites in a protein based on its geometrical structure. Performances of the segmentation algorithms are analyzed according to two evaluation scores describing the capacity of a putative pocket to contact a ligand and to pinpoint the correct binding region. Despite some methods perform remarkably, we show that simple non-machine-learning approaches remain very competitive against data-driven algorithms. In general, the task of pocket detection remains a challenging learning problem which suffers of intrinsic difficulties due to the lack of negative examples (data imbalance problem).
Machine learning assisted droplet trajectories extraction in dense emulsions and their analysis.
EN: This work analyzes trajectories obtained by YOLO and DeepSORT algorithms of dense emulsion systems simulated by Lattice Boltzmann methods. The results indicate that the individual droplet's moving direction is influenced more by the droplets immediately behind it than the droplets in front of it. The analysis also provides hints on constraints on writing down a dynamical model of droplets for the dense emulsion in narrow channels.
A perspective on the current state-of-the-art of quantum computing for drug discovery applications.
EN: Computational chemistry is an essential tool in the pharmaceutical industry. Quantum computing is a fast evolving technology that promises to completely shift the computational capabilities in many areas of chemical research by bringing into reach currently impossible calculations. This perspective illustrates the near-future applicability of quantum computation to pharmaceutical problems. We briefly summarize and compare the scaling properties of state-of-the-art quantum algorithms, and provide novel estimates of the quantum computational cost of simulating progressively larger embedding regions of a pharmaceutically relevant covalent protein-drug complex involving the drug Ibrutinib. Carrying out these calculations requires an error-corrected quantum architecture, that we describe. Our estimates showcase that recent developments on quantum algorithms have dramatically reduced the quantum resources needed to run fully quantum calculations in active spaces of around 50 orbitals and electrons, from estimated over 1000 years using the Trotterisation approach to just a few days with sparse qubitisation, painting a picture of fast and exciting progress in this nascent field.
A Survey on Deep Learning for Skin Lesion Segmentation.
EN: Skin cancer is a major public health problem that could benefit from computer-aided diagnosis to reduce the burden of this common disease. Skin lesion segmentation from images is an important step toward achieving this goal. However, the presence of natural and artificial artifacts (e.g., hair and air bubbles), intrinsic factors (e.g., lesion shape and contrast), and variations in image acquisition conditions make skin lesion segmentation a challenging task. Recently, various researchers have explored the applicability of deep learning models to skin lesion segmentation. In this survey, we cross-examine 177 research papers that deal with deep learning-based segmentation of skin lesions. We analyze these works along several dimensions, including input data (datasets, preprocessing, and synthetic data generation), model design (architecture, modules, and losses), and evaluation aspects (data annotation requirements and segmentation performance). We discuss these dimensions both from the viewpoint of select seminal works, and from a systematic viewpoint, examining how those choices have influenced current trends, and how their limitations should be addressed. To facilitate comparisons...
Effective drug combination for Caenorhabditis elegans nematodes discovered by output-driven feedback system control technique.
EN: Infections from parasitic nematodes (or roundworms) contribute to a significant disease burden and productivity losses for humans and livestock. The limited number of anthelmintics (or antinematode drugs) available today to treat these infections are rapidly losing their efficacy as multidrug resistance in parasites becomes a global health challenge. We propose an engineering approach to discover an anthelmintic drug combination that is more potent at killing wild-type Caenorhabditis elegans worms than four individual drugs. In the experiment, freely swimming single worms are enclosed in microfluidic drug environments to assess the centroid velocity and track curvature of worm movements. After analyzing the behavioral data in every iteration, the feedback system control (FSC) scheme is used to predict new drug combinations to test. Through a differential evolutionary search, the winning drug combination is reached that produces minimal centroid velocity and high track curvature, while requiring each drug in less than their EC50 concentrations. The FSC approach is model-less and does not need any information on the drug pharmacology, signaling pathways, or animal biology. Toward com...
Estimating Waning of Vaccine Effectiveness: a Simulation Study.
EN: Developing accurate and reliable methods to estimate vaccine protection is a key goal in immunology and public health. While several statistical methods have been proposed, their potential inaccuracy in capturing fast intra-seasonal waning of vaccine-induced protection needs to be rigorously investigated. To compare statistical methods for vaccine effectiveness (VE) estimation, we generated simulated data using a multiscale agent-based model of an epidemic with an acute viral infection and differing extents of VE waning. We extended the previously proposed framework for VE measures based on the observational data richness to assess changes of vaccine-induced protection with time. While VE measures based on hard-to-collect information (e.g. exact timing of exposures) were accurate, usually VE studies rely on time-to-infection data and the Cox proportional hazard model. We found that its extension utilizing scaled Schoenfeld residuals, previously proposed for capturing VE waning, was unreliable in capturing both the degree of waning and its functional form and identified the mathematical factors contributing to this unreliability. We showed that partitioning time and including a time...
Ferrocene as an iconic redox marker: from solution chemistry to molecular electronic devices.
EN: Ferrocene, since its discovery in 1951, has been extensively exploited as a redox probe in a variety of processes ranging from solution chemistry, medicinal chemistry, supramolecular chemistry, surface chemistry to solid-state molecular electronic and spintronic circuit elements to unravel electrochemical charge-transfer dynamics. Ferrocene represents an extremely chemically and thermally stable, and highly reproducible redox probe that undergoes reversible one-electron oxidation and reduction occurring at the interfaces of electrode/ferrocene solution in response to applied anodic and cathodic potentials, respectively. It has been almost 70 years after its discovery and has become one of the most widely studied and model organometallic compounds not only for probing electrochemical charge-transfer process but also as molecular building blocks for the synthesis of chiral organometallic catalysts, potential drug candidates, polymeric compounds, electrochemical sensors, to name a few. Ferrocene and its derivatives have been a breakthrough in many aspects due to its versatile reactivity, fascinating chemical structures, unconventional metal-ligand coordination, and the magic number of...
Ferrocene as an iconic redox marker: from solution chemistry to molecular electronic devices.
EN: Ferrocene, since its discovery in 1951, has been extensively exploited as a redox probe in a variety of processes ranging from solution chemistry, medicinal chemistry, supramolecular chemistry, surface chemistry to solid-state molecular electronic and spintronic circuit elements to unravel electrochemical charge-transfer dynamics. Ferrocene represents an extremely chemically and thermally stable, and highly reproducible redox probe that undergoes reversible one-electron oxidation and reduction occurring at the interfaces of electrode/ferrocene solution in response to applied anodic and cathodic potentials, respectively. It has been almost 70 years after its discovery and has become one of the most widely studied and model organometallic compounds not only for probing electrochemical charge-transfer process but also as molecular building blocks for the synthesis of chiral organometallic catalysts, potential drug candidates, polymeric compounds, electrochemical sensors, to name a few. Ferrocene and its derivatives have been a breakthrough in many aspects due to its versatile reactivity, fascinating chemical structures, unconventional metal-ligand coordination, and the magic number of...
Ferrocene as an iconic redox marker: from solution chemistry to molecular electronic devices.
EN: Ferrocene, since its discovery in 1951, has been extensively exploited as a redox probe in a variety of processes ranging from solution chemistry, medicinal chemistry, supramolecular chemistry, surface chemistry to solid-state molecular electronic and spintronic circuit elements to unravel electrochemical charge-transfer dynamics. Ferrocene represents an extremely chemically and thermally stable, and highly reproducible redox probe that undergoes reversible one-electron oxidation and reduction occurring at the interfaces of electrode/ferrocene solution in response to applied anodic and cathodic potentials, respectively. It has been almost 70 years after its discovery and has become one of the most widely studied and model organometallic compounds not only for probing electrochemical charge-transfer process but also as molecular building blocks for the synthesis of chiral organometallic catalysts, potential drug candidates, polymeric compounds, electrochemical sensors, to name a few. Ferrocene and its derivatives have been a breakthrough in many aspects due to its versatile reactivity, fascinating chemical structures, unconventional metal-ligand coordination, and the magic number of...
Slim: interoperable slide microscopy viewer and annotation tool for imaging data science and computational pathology.
EN: The exchange of large and complex slide microscopy imaging data in biomedical research and pathology practice is impeded by a lack of data standardization and interoperability, which is detrimental to the reproducibility of scientific findings and clinical integration of technological innovations. Slim is an open-source, web-based slide microscopy viewer that implements the internationally accepted Digital Imaging and Communications in Medicine (DICOM) standard to achieve interoperability with a multitude of existing medical imaging systems. We showcase the capabilities of Slim as the slide microscopy viewer of the NCI Imaging Data Commons and demonstrate how the viewer enables interactive visualization of traditional brightfield microscopy and highly-multiplexed immunofluorescence microscopy images from The Cancer Genome Atlas and Human Tissue Atlas Network, respectively, using standard DICOMweb services. We further show how Slim enables the collection of standardized image annotations for the development or validation of machine learning models and the visual interpretation of model inference results in the form of segmentation masks, spatial heat maps, or image-derived measureme...
Non-additivities of the particle sizes hidden in model pair potentials and their effects on physical adsorptions.
EN: It is important to understand mechanism of colloidal particles assembly near a substrate for developments of batteries, heterogeneous catalysts, paints, and cosmetics. Knowledge of the mechanism is also important for crystallizations of the colloidal particles and proteins. In this study, we calculated the physical adsorption of colloidal particles on a flat wall by using the integral equation theory, wherein small and large colloidal particles were employed. In the calculation system, electric double layer potentials were used as the pair potentials. In some cases, it was found from the calculation results that the small particles are more easily adsorbed. The result is unusual from the viewpoint of the Asakura-Oosawa theory: we call it "reversal phenomenon". Then, we investigated mechanism of the reversal phenomenon. As a result, it was found that the inversion phenomenon originates from the non-additivities of the particle sizes. In addition, we invented the method to analyze the non-additivity in the pair potentials. The method will be useful for checks of various simulation results and developments of force fields for simulations of the colloidal particles and proteins.
HelixADMET: a robust and endpoint extensible ADMET system incorporating self-supervised knowledge transfer.
EN: Accurate ADMET (an abbreviation for "absorption, distribution, metabolism, excretion, and toxicity") predictions can efficiently screen out undesirable drug candidates in the early stage of drug discovery. In recent years, multiple comprehensive ADMET systems that adopt advanced machine learning models have been developed, providing services to estimate multiple endpoints. However, those ADMET systems usually suffer from weak extrapolation ability. First, due to the lack of labelled data for each endpoint, typical machine learning models perform frail for the molecules with unobserved scaffolds. Second, most systems only provide fixed built-in endpoints and cannot be customised to satisfy various research requirements. To this end, we develop a robust and endpoint extensible ADMET system, HelixADMET (H-ADMET). H-ADMET incorporates the concept of self-supervised learning to produce a robust pre-trained model. The model is then fine-tuned with a multi-task and multi-stage framework to transfer knowledge between ADMET endpoints, auxiliary tasks, and self-supervised tasks. Our results demonstrate that H-ADMET achieves an overall improvement of 4%, compared with existing ADMET systems o...
Collaborative Drug Discovery: Inference-level Data Protection Perspective.
EN: Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks.
SuMe: A Dataset Towards Summarizing Biomedical Mechanisms.
EN: Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretr...
SuMe: A Dataset Towards Summarizing Biomedical Mechanisms.
EN: Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 22k instances. We also introduce conclusion sentence generation as a pretraining task with 611k instances. We benchmark the performance of large bio-domain language models. We find that while the pretr...
We the Droplets: A Constitutional Approach to Active and Self-Propelled Emulsions.
EN: The field of active matter, and particularly active emulsions, is growing rapidly, with significant progress made recently on both theoretical and experimental fronts. Here, we summarize experimental research progress related to active droplets. The constitution of active droplets, in particular the chemical compositions and structure of interfaces, is critical. We discuss how emulsion properties such as mechanism of motion, speed, trajectory, interaction strength, and lifetime are related to the droplet composition. We consider not only traditional single emulsions but also more complex variants, such as Janus droplets, Pickering emulsions, and multiple emulsions. Active behavior of isolated droplets as well as pairwise and multibody interactions between droplets is described. The influence of physical barriers that shape the local chemical gradients and fluid flow is also highlighted. This review provides perspective on the past, current, promising future experimental research directions in active droplet research.
CAVES: A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines.
EN: Convincing people to get vaccinated against COVID-19 is a key societal challenge in the present times. As a first step towards this goal, many prior works have relied on social media analysis to understand the specific concerns that people have towards these vaccines, such as potential side-effects, ineffectiveness, political factors, and so on. Though there are datasets that broadly classify social media posts into Anti-vax and Pro-Vax labels, there is no dataset (to our knowledge) that labels social media posts according to the specific anti-vaccine concerns mentioned in the posts. In this paper, we have curated CAVES, the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets. We also perform preliminary experiments on the dataset and show that this is a very challenging dataset for multi-label explainable classification and tweet summarization, as is evident by the moderate scores achieved by so...
Infusing Linguistic Knowledge of SMILES into Chemical Language Models.
EN: The simplified molecular-input line-entry system (SMILES) is the most popular representation of chemical compounds. Therefore, many SMILES-based molecular property prediction models have been developed. In particular, transformer-based models show promising performance because the model utilizes a massive chemical dataset for self-supervised learning. However, there is no transformer-based model to overcome the inherent limitations of SMILES, which result from the generation process of SMILES. In this study, we grammatically parsed SMILES to obtain connectivity between substructures and their type, which is called the grammatical knowledge of SMILES. First, we pretrained the transformers with substructural tokens, which were parsed from SMILES. Then, we used the training strategy 'same compound model' to better understand SMILES grammar. In addition, we injected knowledge of connectivity and type into the transformer with knowledge adapters. As a result, our representation model outperformed previous compound representations for the prediction of molecular properties. Finally, we analyzed the attention of the transformer model and adapters, demonstrating that the proposed model und...
Generating 3D Molecules for Target Protein Binding.
EN: A fundamental problem in drug discovery is to design molecules that bind to specific proteins. To tackle this problem using machine learning methods, here we propose a novel and effective framework, known as GraphBP, to generate 3D molecules that bind to given proteins by placing atoms of specific types and locations to the given binding site one by one. In particular, at each step, we first employ a 3D graph neural network to obtain geometry-aware and chemically informative representations from the intermediate contextual information. Such context includes the given binding site and atoms placed in the previous steps. Second, to preserve the desirable equivariance property, we select a local reference atom according to the designed auxiliary classifiers and then construct a local spherical coordinate system. Finally, to place a new atom, we generate its atom type and relative location w.r.t. the constructed local coordinate system via a flow model. We also consider generating the variables of interest sequentially to capture the underlying dependencies among them. Experiments demonstrate that our GraphBP is effective to generate 3D molecules with binding ability to target protein ...
Accurate ADMET Prediction with XGBoost.
EN: The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. Our model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, our model is ranked first in 18 tasks and top 3 in 21 tasks. The trained machine learning models are integrated in ADMETboost, a web server that is publicly available at https://ai-druglab.smu.edu/admet.
Characterizing metastable states with the help of machine learning.
EN: Present-day atomistic simulations generate long trajectories of ever more complex systems. Analyzing these data, discovering metastable states, and uncovering their nature is becoming increasingly challenging. In this paper, we first use the variational approach to conformation dynamics to discover the slowest dynamical modes of the simulations. This allows the different metastable states of the system to be located and organized hierarchically. The physical descriptors that characterize metastable states are discovered by means of a machine learning method. We show in the cases of two proteins, Chignolin and Bovine Pancreatic Trypsin Inhibitor, how such analysis can be effortlessly performed in a matter of seconds. Another strength of our approach is that it can be applied to the analysis of both unbiased and biased simulations.
MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction.
EN: Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general...
MedDistant19: Towards an Accurate Benchmark for Broad-Coverage Biomedical Relation Extraction.
EN: Relation extraction in the biomedical domain is challenging due to the lack of labeled data and high annotation costs, needing domain experts. Distant supervision is commonly used to tackle the scarcity of annotated data by automatically pairing knowledge graph relationships with raw texts. Such a pipeline is prone to noise and has added challenges to scale for covering a large number of biomedical concepts. We investigated existing broad-coverage distantly supervised biomedical relation extraction benchmarks and found a significant overlap between training and test relationships ranging from 26% to 86%. Furthermore, we noticed several inconsistencies in the data construction process of these benchmarks, and where there is no train-test leakage, the focus is on interactions between narrower entity types. This work presents a more accurate benchmark MedDistant19 for broad-coverage distantly supervised biomedical relation extraction that addresses these shortcomings and is obtained by aligning the MEDLINE abstracts with the widely used SNOMED Clinical Terms knowledge base. Lacking thorough evaluation with domain-specific language models, we also conduct experiments validating general...
In-Pocket 3D Graphs Enhance Ligand-Target Compatibility in Generative Small-Molecule Creation.
EN: Proteins in complex with small molecule ligands represent the core of structure-based drug discovery. However, three-dimensional representations are absent from most deep-learning-based generative models. We here present a graph-based generative modeling technology that encodes explicit 3D protein-ligand contacts within a relational graph architecture. The models combine a conditional variational autoencoder that allows for activity-specific molecule generation with putative contact generation that provides predictions of molecular interactions within the target binding pocket. We show that molecules generated with our 3D procedure are more compatible with the binding pocket of the dopamine D2 receptor than those produced by a comparable ligand-based 2D generative method, as measured by docking scores, expected stereochemistry, and recoverability in commercial chemical databases. Predicted protein-ligand contacts were found among highest-ranked docking poses with a high recovery rate. This work shows how the structural context of a protein target can be used to enhance molecule generation.
Modeling COVID-19 vaccine-induced immunological memory development and its links to antibody level and infectiousness.
EN: COVID-19 vaccines have proven to be effective against SARS-CoV-2 infection. However, the dynamics of vaccine-induced immunological memory development and neutralizing antibodies generation are not fully understood, limiting vaccine development and vaccination regimen determination. Herein, we constructed a mathematical model to characterize the vaccine-induced immune response based on fitting the viral infection and vaccination datasets. With the example of CoronaVac, we revealed the association between vaccine-induced immunological memory development and neutralizing antibody levels. The establishment of the intact immunological memory requires more than 6 months after the first and second doses, after that a booster shot can induce high levels neutralizing antibodies. By introducing the maximum viral load and recovery time after viral infection, we quantitatively studied the protective effect of vaccines against viral infection. Accordingly, we optimized the vaccination regimen, including dose and vaccination timing, and predicted the effect of the fourth dose. Last, by combining the viral transmission model, we showed the suppression of virus transmission by vaccination, which m...
Dynamic of Single Molecules in Collective Light-Matter States from First Principles.
EN: The coherent interaction of a large collection of molecules with a common photonic mode results in strong light-matter coupling, a feature that proved highly beneficial for chemistry and termed the research topics polaritonic and QED chemistry. Considering complex microscopic chemical reactions in combination with a macroscopic number of molecules renders existing ab initio approaches inapplicable. In this work, I introduce a simple approach to capture the collective nature while retaining the full ab initio representation of single molecules. By embedding the majority of the molecular ensemble into the dyadic Green tensor, we obtain a computationally cheap and intuitive description of the dynamic of a single molecule in the ensemble - an approach that seems ideal for polaritonic chemistry. The introduced embedding radiation-reaction potential is thoroughly discussed, including prospects, applications and limitations. A first application demonstrates the linear response of single molecules that are part of a larger ensembles of molecules. Then, by virtue of a simple proton-tunneling model, I illustrate that the influence of collective strong coupling on chemical reactions features ...
Dynamic of Single Molecules in Collective Light-Matter States from First Principles.
EN: The coherent interaction of a large collection of molecules with a common photonic mode results in strong light-matter coupling, a feature that proved highly beneficial for chemistry and termed the research topics polaritonic and QED chemistry. Considering complex microscopic chemical reactions in combination with a macroscopic number of molecules renders existing ab initio approaches inapplicable. In this work, I introduce a simple approach to capture the collective nature while retaining the full ab initio representation of single molecules. By embedding the majority of the molecular ensemble into the dyadic Green tensor, we obtain a computationally cheap and intuitive description of the dynamic of a single molecule in the ensemble - an approach that seems ideal for polaritonic chemistry. The introduced embedding radiation-reaction potential is thoroughly discussed, including prospects, applications and limitations. A first application demonstrates the linear response of single molecules that are part of a larger ensembles of molecules. Then, by virtue of a simple proton-tunneling model, I illustrate that the influence of collective strong coupling on chemical reactions features ...
Dynamic of Single Molecules in Collective Light-Matter States from First Principles.
EN: The coherent interaction of a large collection of molecules with a common photonic mode results in strong light-matter coupling, a feature that proved highly beneficial for chemistry and termed the research topics polaritonic and QED chemistry. Considering complex microscopic chemical reactions in combination with a macroscopic number of molecules renders existing ab initio approaches inapplicable. In this work, I introduce a simple approach to capture the collective nature while retaining the full ab initio representation of single molecules. By embedding the majority of the molecular ensemble into the dyadic Green tensor, we obtain a computationally cheap and intuitive description of the dynamic of a single molecule in the ensemble - an approach that seems ideal for polaritonic chemistry. The introduced embedding radiation-reaction potential is thoroughly discussed, including prospects, applications and limitations. A first application demonstrates the linear response of single molecules that are part of a larger ensembles of molecules. Then, by virtue of a simple proton-tunneling model, I illustrate that the influence of collective strong coupling on chemical reactions features ...
GrowliFlower: An image time series dataset for GROWth analysis of cauLIFLOWER.
EN: This article presents GrowliFlower, a georeferenced, image-based UAV time series dataset of two monitored cauliflower fields of size 0.39 and 0.60 ha acquired in 2020 and 2021. The dataset contains RGB and multispectral orthophotos from which about 14,000 individual plant coordinates are derived and provided. The coordinates enable the dataset users the extraction of complete and incomplete time series of image patches showing individual plants. The dataset contains collected phenotypic traits of 740 plants, including the developmental stage as well as plant and cauliflower size. As the harvestable product is completely covered by leaves, plant IDs and coordinates are provided to extract image pairs of plants pre and post defoliation, to facilitate estimations of cauliflower head size. Moreover, the dataset contains pixel-accurate leaf and plant instance segmentations, as well as stem annotations to address tasks like classification, detection, segmentation, instance segmentation, and similar computer vision tasks. The dataset aims to foster the development and evaluation of machine learning approaches. It specifically focuses on the analysis of growth and development of cauliflowe...
SELFIES and the future of molecular string representations.
EN: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. T...
SELFIES and the future of molecular string representations.
EN: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. T...
SELFIES and the future of molecular string representations.
EN: Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. T...
On the Origins of Life's Homochirality: Inducing Enantiomeric Excess with Spin-Polarized Electrons.
EN: Life as we know it is homochiral, but the origins of biological homochirality on early Earth remain elusive. Shallow closed-basin lakes are a plausible prebiotic environment on early Earth, and most are expected to have significant sedimentary magnetite deposits. We hypothesize that UV (200-300nm) irradiation of magnetite deposits could generate hydrated spin-polarized electrons sufficient to induce chirally selective prebiotic chemistry. Such electrons are potent reducing agents that drive reduction reactions where the spin polarization direction can alter enantioselectively the reaction kinetics. Our estimate of this chiral bias is based on the strong effective spin-orbit coupling observed in the chiral-induced spin selectivity (CISS) effect, as applied to energy differences in reduction reactions for different isomers. In the original CISS experiments, spin selective electron transmission through a monolayer of dsDNA molecules is observed at room temperature - indicating a strong coupling between molecular chirality and electron spin. We propose that the chiral symmetry breaking due to the CISS effect, when applied to reduction chemistry, can induce enantioselective synthesis on...
Entity-driven Fact-aware Abstractive Summarization of Biomedical Literature.
EN: As part of the large number of scientific articles being published every year, the publication rate of biomedical literature has been increasing. Consequently, there has been considerable effort to harness and summarize the massive amount of biomedical research articles. While transformer-based encoder-decoder models in a vanilla source document-to-summary setting have been extensively studied for abstractive summarization in different domains, their major limitations continue to be entity hallucination (a phenomenon where generated summaries constitute entities not related to or present in source article(s)) and factual inconsistency. This problem is exacerbated in a biomedical setting where named entities and their semantics (which can be captured through a knowledge base) constitute the essence of an article. The use of named entities and facts mined from background knowledge bases pertaining to the named entities to guide abstractive summarization has not been studied in biomedical article summarization literature. In this paper, we propose an entity-driven fact-aware framework for training end-to-end transformer-based encoder-decoder models for abstractive summarization of bio...
Entity-driven Fact-aware Abstractive Summarization of Biomedical Literature.
EN: As part of the large number of scientific articles being published every year, the publication rate of biomedical literature has been increasing. Consequently, there has been considerable effort to harness and summarize the massive amount of biomedical research articles. While transformer-based encoder-decoder models in a vanilla source document-to-summary setting have been extensively studied for abstractive summarization in different domains, their major limitations continue to be entity hallucination (a phenomenon where generated summaries constitute entities not related to or present in source article(s)) and factual inconsistency. This problem is exacerbated in a biomedical setting where named entities and their semantics (which can be captured through a knowledge base) constitute the essence of an article. The use of named entities and facts mined from background knowledge bases pertaining to the named entities to guide abstractive summarization has not been studied in biomedical article summarization literature. In this paper, we propose an entity-driven fact-aware framework for training end-to-end transformer-based encoder-decoder models for abstractive summarization of bio...
New pyramidal hybrid textural and deep features based automatic skin cancer classification model: Ensemble DarkNet and textural feature extractor.
EN: Background: Skin cancer is one of the widely seen cancer worldwide and automatic classification of skin cancer can be benefited dermatology clinics for an accurate diagnosis. Hence, a machine learning-based automatic skin cancer detection model must be developed. Material and Method: This research interests to overcome automatic skin cancer detection problem. A colored skin cancer image dataset is used. This dataset contains 3297 images with two classes. An automatic multilevel textural and deep features-based model is presented. Multilevel fuse feature generation using discrete wavelet transform (DWT), local phase quantization (LPQ), local binary pattern (LBP), pre-trained DarkNet19, and DarkNet53 are utilized to generate features of the skin cancer images, top 1000 features are selected threshold value-based neighborhood component analysis (NCA). The chosen top 1000 features are classified using the 10-fold cross-validation technique. Results: To obtain results, ten-fold cross-validation is used and 91.54% classification accuracy results are obtained by using the recommended pyramidal hybrid feature generator and NCA selector-based model. Further, various training and testing sep...
Root-aligned SMILES: A Tight Representation for Chemical Reaction Prediction.
EN: Chemical reaction prediction, involving forward synthesis and retrosynthesis prediction, is a fundamental problem in organic synthesis. A popular computational paradigm formulates synthesis prediction as a sequence-to-sequence translation problem, where the typical SMILES is adopted for molecule representations. However, the general-purpose SMILES neglects the characteristics of chemical reactions, where the molecular graph topology is largely unaltered from reactants to products, resulting in the suboptimal performance of SMILES if straightforwardly applied. In this article, we propose the root-aligned SMILES (R-SMILES), which specifies a tightly aligned one-to-one mapping between the product and the reactant SMILES for more efficient synthesis prediction. Due to the strict one-to-one mapping and reduced edit distance, the computational model is largely relieved from learning the complex syntax and dedicated to learning the chemical knowledge for reactions. We compare the proposed R-SMILES with various state-of-the-art baselines and show that it significantly outperforms them all, demonstrating the superiority of the proposed method.
A 3D Generative Model for Structure-Based Drug Design.
EN: We study a fundamental problem in structure-based drug design -- generating molecules that bind to specific protein binding sites. While we have witnessed the great success of deep generative models in drug design, the existing methods are mostly string-based or graph-based. They are limited by the lack of spatial information and thus unable to be applied to structure-based design tasks. Particularly, such models have no or little knowledge of how molecules interact with their target proteins exactly in 3D space. In this paper, we propose a 3D generative model that generates molecules given a designated 3D protein binding site. Specifically, given a binding site as the 3D context, our model estimates the probability density of atom's occurrences in 3D space -- positions that are more likely to have atoms will be assigned higher probability. To generate 3D molecules, we propose an auto-regressive sampling scheme -- atoms are sampled sequentially from the learned distribution until there is no room for new atoms. Combined with this sampling scheme, our model can generate valid and diverse molecules, which could be applicable to various structure-based molecular design tasks such as m...
BIOS: An Algorithmically Generated Biomedical Knowledge Graph.
EN: Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For decades, these knowledge graphs have been developed via expert curation; however, this method can no longer keep up with today's AI development, and a transition to algorithmically generated BioMedKGs is necessary. In this work, we introduce the Biomedical Informatics Ontology System (BIOS), the first large-scale publicly available BioMedKG generated completely by machine learning algorithms. BIOS currently contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets. We present the methodology for developing BIOS, including the curation of raw biomedical terms, computational identification of synonymous terms and aggregation of these terms to create concept nodes, semantic type classification of the concepts, relation identification, and biomedical machine translation. We provide statistics on the current BIOS content and perform preliminary assessments of term quality, synonym grouping, and relation extraction. The re...
BIOS: An Algorithmically Generated Biomedical Knowledge Graph.
EN: Biomedical knowledge graphs (BioMedKGs) are essential infrastructures for biomedical and healthcare big data and artificial intelligence (AI), facilitating natural language processing, model development, and data exchange. For decades, these knowledge graphs have been developed via expert curation; however, this method can no longer keep up with today's AI development, and a transition to algorithmically generated BioMedKGs is necessary. In this work, we introduce the Biomedical Informatics Ontology System (BIOS), the first large-scale publicly available BioMedKG generated completely by machine learning algorithms. BIOS currently contains 4.1 million concepts, 7.4 million terms in two languages, and 7.3 million relation triplets. We present the methodology for developing BIOS, including the curation of raw biomedical terms, computational identification of synonymous terms and aggregation of these terms to create concept nodes, semantic type classification of the concepts, relation identification, and biomedical machine translation. We provide statistics on the current BIOS content and perform preliminary assessments of term quality, synonym grouping, and relation extraction. The re...
Disparities in Dermatology AI Performance on a Diverse, Curated Clinical Image Set.
EN: Access to dermatological care is a major issue, with an estimated 3 billion people lacking access to care globally. Artificial intelligence (AI) may aid in triaging skin diseases. However, most AI models have not been rigorously assessed on images of diverse skin tones or uncommon diseases. To ascertain potential biases in algorithm performance in this context, we curated the Diverse Dermatology Images (DDI) dataset-the first publicly available, expertly curated, and pathologically confirmed image dataset with diverse skin tones. Using this dataset of 656 images, we show that state-of-the-art dermatology AI models perform substantially worse on DDI, with receiver operator curve area under the curve (ROC-AUC) dropping by 27-36 percent compared to the models' original test results. All the models performed worse on dark skin tones and uncommon diseases, which are represented in the DDI dataset. Additionally, we find that dermatologists, who typically provide visual labels for AI training and test datasets, also perform worse on images of dark skin tones and uncommon diseases compared to ground truth biopsy annotations. Finally, fine-tuning AI models on the well-characterized and dive...
Automated clustering of COVID-19 anti-vaccine discourse on Twitter.
EN: Attitudes about vaccination have become more polarized; it is common to see vaccine disinformation and fringe conspiracy theories online. An observational study of Twitter vaccine discourse is found in Ojea Quintana et al. (2021): the authors analyzed approximately six months' of Twitter discourse -- 1.3 million original tweets and 18 million retweets between December 2019 and June 2020, ranging from before to after the establishment of Covid-19 as a pandemic. This work expands upon Ojea Quintana et al. (2021) with two main contributions from data science. First, based on the authors' initial network clustering and qualitative analysis techniques, we are able to clearly demarcate and visualize the language patterns used in discourse by Antivaxxers (anti-vaccination campaigners and vaccine deniers) versus other clusters (collectively, Others). Second, using the characteristics of Antivaxxers' tweets, we develop text classifiers to determine the likelihood a given user is employing anti-vaccination language, ultimately contributing to an early-warning mechanism to improve the health of our epistemic environment and bolster (and not hinder) public health initiatives.
BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task.
EN: Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the n...
BioADAPT-MRC: Adversarial Learning-based Domain Adaptation Improves Biomedical Machine Reading Comprehension Task.
EN: Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the n...
DermX: an end-to-end framework for explainable automated dermatological diagnosis.
EN: Dermatological diagnosis automation is essential in addressing the high prevalence of skin diseases and critical shortage of dermatologists. Despite approaching expert-level diagnosis performance, convolutional neural network (ConvNet) adoption in clinical practice is impeded by their limited explainability, and by subjective, expensive explainability validations. We introduce DermX and DermX+, an end-to-end framework for explainable automated dermatological diagnosis. DermX is a clinically-inspired explainable dermatological diagnosis ConvNet, trained using DermXDB, a 554 image dataset annotated by eight dermatologists with diagnoses, supporting explanations, and explanation attention maps. DermX+ extends DermX with guided attention training for explanation attention maps. Both methods achieve near-expert diagnosis performance, with DermX, DermX+, and dermatologist F1 scores of 0.79, 0.79, and 0.87, respectively. We assess the explanation performance in terms of identification and localization by comparing model-selected with dermatologist-selected explanations, and gradient-weighted class-activation maps with dermatologist explanation maps, respectively. DermX obtained an identif...
Federated Contrastive Learning for Dermatological Disease Diagnosis via On-device Learning.
EN: Deep learning models have been deployed in an increasing number of edge and mobile devices to provide healthcare. These models rely on training with a tremendous amount of labeled data to achieve high accuracy. However, for medical applications such as dermatological disease diagnosis, the private data collected by mobile dermatology assistants exist on distributed mobile devices of patients, and each device only has a limited amount of data. Directly learning from limited data greatly deteriorates the performance of learned models. Federated learning (FL) can train models by using data distributed on devices while keeping the data local for privacy. Existing works on FL assume all the data have ground-truth labels. However, medical data often comes without any accompanying labels since labeling requires expertise and results in prohibitively high labor costs. The recently developed self-supervised learning approach, contrastive learning (CL), can leverage the unlabeled data to pre-train a model, after which the model is fine-tuned on limited labeled data for dermatological disease diagnosis. However, simply combining CL with FL as federated contrastive learning (FCL) will result i...
SuperCon: Supervised Contrastive Learning for Imbalanced Skin Lesion Classification.
EN: Convolutional neural networks (CNNs) have achieved great success in skin lesion classification. A balanced dataset is required to train a good model. However, due to the appearance of different skin lesions in practice, severe or even deadliest skin lesion types (e.g., melanoma) naturally have quite small amount represented in a dataset. In that, classification performance degradation occurs widely, it is significantly important to have CNNs that work well on class imbalanced skin lesion image dataset. In this paper, we propose SuperCon, a two-stage training strategy to overcome the class imbalance problem on skin lesion classification. It contains two stages: (i) representation training that tries to learn a feature representation that closely aligned among intra-classes and distantly apart from inter-classes, and (ii) classifier fine-tuning that aims to learn a classifier that correctly predict the label based on the learnt representations. In the experimental evaluation, extensive comparisons have been made among our approach and other existing approaches on skin lesion benchmark datasets. The results show that our two-stage training strategy effectively addresses the class imba...
Surface astrochemistry: a computational chemistry perspective.
EN: Molecules in space are synthesized via a large variety of gas-phase reactions, and reactions on dust-grain surfaces, where the surface acts as a catalyst. Especially, saturated, hydrogen-rich molecules are formed through surface chemistry. Astrochemical models have developed over the decades to understand the molecular processes in the interstellar medium, taking into account grain surface chemistry. However, essential input information for gas-grain models, such as binding energies of molecules to the surface, have been derived experimentally only for a handful of species, leaving hundreds of species with highly uncertain estimates. Moreover, some fundamental processes are not well enough constrained to implement these into the models. The proceedings gives three examples how computational chemistry techniques can help answer fundamental questions regarding grain surface chemistry.
Surface astrochemistry: a computational chemistry perspective.
EN: Molecules in space are synthesized via a large variety of gas-phase reactions, and reactions on dust-grain surfaces, where the surface acts as a catalyst. Especially, saturated, hydrogen-rich molecules are formed through surface chemistry. Astrochemical models have developed over the decades to understand the molecular processes in the interstellar medium, taking into account grain surface chemistry. However, essential input information for gas-grain models, such as binding energies of molecules to the surface, have been derived experimentally only for a handful of species, leaving hundreds of species with highly uncertain estimates. Moreover, some fundamental processes are not well enough constrained to implement these into the models. The proceedings gives three examples how computational chemistry techniques can help answer fundamental questions regarding grain surface chemistry.
Surface astrochemistry: a computational chemistry perspective.
EN: Molecules in space are synthesized via a large variety of gas-phase reactions, and reactions on dust-grain surfaces, where the surface acts as a catalyst. Especially, saturated, hydrogen-rich molecules are formed through surface chemistry. Astrochemical models have developed over the decades to understand the molecular processes in the interstellar medium, taking into account grain surface chemistry. However, essential input information for gas-grain models, such as binding energies of molecules to the surface, have been derived experimentally only for a handful of species, leaving hundreds of species with highly uncertain estimates. Moreover, some fundamental processes are not well enough constrained to implement these into the models. The proceedings gives three examples how computational chemistry techniques can help answer fundamental questions regarding grain surface chemistry.
Surface astrochemistry: a computational chemistry perspective.
EN: Molecules in space are synthesized via a large variety of gas-phase reactions, and reactions on dust-grain surfaces, where the surface acts as a catalyst. Especially, saturated, hydrogen-rich molecules are formed through surface chemistry. Astrochemical models have developed over the decades to understand the molecular processes in the interstellar medium, taking into account grain surface chemistry. However, essential input information for gas-grain models, such as binding energies of molecules to the surface, have been derived experimentally only for a handful of species, leaving hundreds of species with highly uncertain estimates. Moreover, some fundamental processes are not well enough constrained to implement these into the models. The proceedings gives three examples how computational chemistry techniques can help answer fundamental questions regarding grain surface chemistry.
Personalized Public Policy Analysis in Social Sciences using Causal-Graphical Normalizing Flows.
EN: Structural Equation/Causal Models (SEMs/SCMs) are widely used in epidemiology and social sciences to identify and analyze the average causal effect (ACE) and conditional ACE (CACE). Traditional causal effect estimation methods such as Inverse Probability Weighting (IPW) and more recently Regression-With-Residuals (RWR) are widely used - as they avoid the challenging task of identifying the SCM parameters - to estimate ACE and CACE. However, much work remains before traditional estimation methods can be used for counterfactual inference, and for the benefit of Personalized Public Policy Analysis (P$^3$A) in the social sciences. While doctors rely on personalized medicine to tailor treatments to patients in laboratory settings (relatively closed systems), P$^3$A draws inspiration from such tailoring but adapts it for open social systems. In this article, we develop a method for counterfactual inference that we name causal-Graphical Normalizing Flow (c-GNF), facilitating P$^3$A. First, we show how c-GNF captures the underlying SCM without making any assumption about functional forms. Second, we propose a novel dequantization trick to deal with discrete variables, which is a limitation...
Optimal vaccination at high reproductive numbers: sharp transitions and counter-intuitive allocations.
EN: Optimization of vaccine allocations among different segments of a heterogeneous population is important for enhancing the effectiveness of vaccination campaigns in reducing the burden of epidemics. Intuitively, it would seem that allocations designed to minimize infections should prioritize those with the highest risk of being infected and infecting others. This prescription is well supported by vaccination theory, e.g., when the vaccination campaign aims to reach herd immunity. In this work, we show, however, that for vaccines providing partial protection (leaky vaccines) and for sufficiently high values of the basic reproduction number, intuition is overturned: the optimal allocation for minimizing the number of infections prioritizes the vaccination of those who are least likely to be infected. Furthermore, we show that this phenomenon occurs at a range of basic reproduction numbers relevant for the currently circulating strains of SARS-CoV-19. The work combines numerical investigations, asymptotic analysis for a general model, and complete mathematical analysis in a simple two-group model. The results point to important considerations in managing vaccination campaigns for infec...
Transformers and the representation of biomedical background knowledge.
EN: Specialised transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.
Transformers and the representation of biomedical background knowledge.
EN: Specialised transformers-based models (such as BioBERT and BioMegatron) are adapted for the biomedical domain based on publicly available biomedical corpora. As such, they have the potential to encode large-scale biological knowledge. We investigate the encoding and representation of biological knowledge in these models, and its potential utility to support inference in cancer precision medicine - namely, the interpretation of the clinical significance of genomic alterations. We compare the performance of different transformer baselines; we use probing to determine the consistency of encodings for distinct entities; and we use clustering methods to compare and contrast the internal properties of the embeddings for genes, variants, drugs and diseases. We show that these models do indeed encode biological knowledge, although some of this is lost in fine-tuning for specific tasks. Finally, we analyse how the models behave with regard to biases and imbalances in the dataset.
The Impact of Vaccination on the Infection rate and the Severity of Covid-19.
EN: This study aims to statistically assess the effectiveness of vaccination against SARS-CoV-2. It is indispensable to investigate the relationship between Covid-19 deadliness and vaccination in order to study the impact of vaccine in real-world. We studied rates of infection and death due to Covid-19 in different countries with respect to their levels of vaccination. People who received the required dose of vaccination were considered as fully vaccinated in this study. Based on the percentage of fully vaccinated population, countries were categorized into several groups. Though a high-level study on the vaccine effectiveness may not provide much insight for individual level differences, a global analysis is imperative to infer the influence of vaccination as a controlling measure of the pandemic.
Classification of Skin Cancer Images using Convolutional Neural Networks.
EN: Skin cancer is the most common human malignancy(American Cancer Society) which is primarily diagnosed visually, starting with an initial clinical screening and followed potentially by dermoscopic(related to skin) analysis, a biopsy and histopathological examination. Skin cancer occurs when errors (mutations) occur in the DNA of skin cells. The mutations cause the cells to grow out of control and form a mass of cancer cells. The aim of this study was to try to classify images of skin lesions with the help of convolutional neural networks. The deep neural networks show humongous potential for image classification while taking into account the large variability exhibited by the environment. Here we trained images based on the pixel values and classified them on the basis of disease labels. The dataset was acquired from an Open Source Kaggle Repository(Kaggle Dataset)which itself was acquired from ISIC(International Skin Imaging Collaboration) Archive. The training was performed on multiple models accompanied with Transfer Learning. The highest model accuracy achieved was over 86.65%. The dataset used is publicly available to ensure credibility and reproducibility of the aforementioned...
Interconnect Parasitics and Partitioning in Fully-Analog In-Memory Computing Architectures.
EN: Fully-analog in-memory computing (IMC) architectures that implement both matrix-vector multiplication and non-linear vector operations within the same memory array have shown promising performance benefits over conventional IMC systems due to the removal of energy-hungry signal conversion units. However, maintaining the computation in the analog domain for the entire deep neural network (DNN) comes with potential sensitivity to interconnect parasitics. Thus, in this paper, we investigate the effect of wire parasitic resistance and capacitance on the accuracy of DNN models deployed on fully-analog IMC architectures. Moreover, we propose a partitioning mechanism to alleviate the impact of the parasitic while keeping the computation in the analog domain through dividing large arrays into multiple partitions. The SPICE circuit simulation results for a 400 X 120 X 84 X 10 DNN model deployed on a fully-analog IMC circuit show that a 94.84% accuracy could be achieved for MNIST classification application with 16, 8, and 8 horizontal partitions, as well as 8, 8, and 1 vertical partitions for first, second, and third layers of the DNN, respectively, which is comparable to the ~97% accuracy r...
Reducing COVID-19 Cases and Deaths by Applying Blockchain in Vaccination Rollout Management.
EN: Because a fast vaccination rollout against coronavirus disease 2019 (COVID-19) is critical to restore daily life and avoid virus mutations, it is tempting to have a relaxed vaccination-administration management system. However, a robust management system can support the enforcement of preventive measures, and in turn, reduce incidence and deaths. Here, we model a trustable and reliable management system based on blockchain for vaccine distribution by extending the Susceptible-Exposed-Infected-Recovery (SEIR) model. The model includes prevention measures such as mask-wearing, social distance, vaccination rate, and vaccination efficiency. It also considers negative social behavior, such as violations of social distance and attempts of using illegitimate vaccination proofs. By evaluating the model, we show that the proposed system can reduce up to 2.5 million cases and half a million deaths in the most demanding scenarios.
COVID-19 forecasting using new viral variants and vaccination effectiveness models.
EN: Background: Recently, a high number of daily positive COVID-19 cases have been reported in regions with relatively high vaccination rates; hence, booster vaccination has become necessary. In addition, infections caused by the different variants and correlated factors have not been discussed in depth. With large variabilities and different co-factors, it is difficult to use conventional mathematical models to forecast the incidence of COVID-19. Methods: Machine learning based on long short-term memory was applied to forecasting the time series of new daily positive cases (DPC), serious cases, hospitalized cases, and deaths. Data acquired from regions with high rates of vaccination, such as Israel, were blended with the current data of other regions in Japan to factor in the potential effects of vaccination. The protection provided by symptomatic infection was also considered in terms of the population effectiveness of vaccination as well as the waning protection and ratio and infectivity of viral variants. To represent changes in public behavior, public mobility and interactions through social media were also included in the analysis. Findings: Comparing the observed and estimat...
A Survey on Training Challenges in Generative Adversarial Networks for Biomedical Image Analysis.
EN: In biomedical image analysis, the applicability of deep learning methods is directly impacted by the quantity of image data available. This is due to deep learning models requiring large image datasets to provide high-level performance. Generative Adversarial Networks (GANs) have been widely utilized to address data limitations through the generation of synthetic biomedical images. GANs consist of two models. The generator, a model that learns how to produce synthetic images based on the feedback it receives. The discriminator, a model that classifies an image as synthetic or real and provides feedback to the generator. Throughout the training process, a GAN can experience several technical challenges that impede the generation of suitable synthetic imagery. First, the mode collapse problem whereby the generator either produces an identical image or produces a uniform image from distinct input features. Second, the non-convergence problem whereby the gradient descent optimizer fails to reach a Nash equilibrium. Thirdly, the vanishing gradient problem whereby unstable training behavior occurs due to the discriminator achieving optimal classification performance resulting in no meani...
A Survey on Training Challenges in Generative Adversarial Networks for Biomedical Image Analysis.
EN: In biomedical image analysis, the applicability of deep learning methods is directly impacted by the quantity of image data available. This is due to deep learning models requiring large image datasets to provide high-level performance. Generative Adversarial Networks (GANs) have been widely utilized to address data limitations through the generation of synthetic biomedical images. GANs consist of two models. The generator, a model that learns how to produce synthetic images based on the feedback it receives. The discriminator, a model that classifies an image as synthetic or real and provides feedback to the generator. Throughout the training process, a GAN can experience several technical challenges that impede the generation of suitable synthetic imagery. First, the mode collapse problem whereby the generator either produces an identical image or produces a uniform image from distinct input features. Second, the non-convergence problem whereby the gradient descent optimizer fails to reach a Nash equilibrium. Thirdly, the vanishing gradient problem whereby unstable training behavior occurs due to the discriminator achieving optimal classification performance resulting in no meani...
Physical mechanisms for droplet size and effective viscosity asymmetries in turbulent emulsions.
EN: By varying the oil volume fraction, the microscopic droplet size and the macroscopic rheology of emulsions are investigated in a Taylor-Couette (TC) turbulent shear flow. Although here oil and water in the emulsions have almost the same physical properties (density and viscosity), unexpectedly, we find that oil-in-water (O/W) and water-in-oil (W/O) emulsions have very distinct hydrodynamic behaviors, i.e., the system is clearly asymmetric. By looking at the micro-scales, the average droplet diameter hardly changes with the oil volume fraction neither for O/W nor for W/O. However, for W/O it is about 50% larger than that of O/W. At the macro-scales, the effective viscosity of O/W is higher when compared to that of W/O. These asymmetric behaviors can be traced back to the presence of surface-active contaminants in the system. By introducing an oil-soluble surfactant at high concentration, remarkably, we recover the symmetry (droplet size and effective viscosity) between O/W and W/O emulsions. Based on this, we suggest a possible mechanism responsible for the initial asymmetry. Next, we discuss what sets the droplet size in turbulent emulsions. We uncover a -6/5 scaling dependence of ...
Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task.
EN: Rapid growth of the biomedical literature has led to many advances in the biomedical text mining field. Among the vast amount of information, biomedical article abstracts are the easily accessible sources. However, the number of the structured abstracts, describing the rhetorical sections with one of Background, Objective, Method, Result and Conclusion categories is still not considerable. Exploration of valuable information in the biomedical abstracts can be expedited with the improvements in the sequential sentence classification task. Deep learning based models has great performance/potential in achieving significant results in this task. However, they can often be overly complex and overfit to specific data. In this project, we study a state-of-the-art deep learning model, which we called SSN-4 model here. We investigate different components of the SSN-4 model to study the trade-off between the performance and complexity. We explore how well this model generalizes to a new data set beyond Randomized Controlled Trials (RCT) dataset. We address the question that whether word embeddings can be adjusted to the task to improve the performance. Furthermore, we develop a second model ...
Sectioning of Biomedical Abstracts: A Sequence of Sequence Classification Task.
EN: Rapid growth of the biomedical literature has led to many advances in the biomedical text mining field. Among the vast amount of information, biomedical article abstracts are the easily accessible sources. However, the number of the structured abstracts, describing the rhetorical sections with one of Background, Objective, Method, Result and Conclusion categories is still not considerable. Exploration of valuable information in the biomedical abstracts can be expedited with the improvements in the sequential sentence classification task. Deep learning based models has great performance/potential in achieving significant results in this task. However, they can often be overly complex and overfit to specific data. In this project, we study a state-of-the-art deep learning model, which we called SSN-4 model here. We investigate different components of the SSN-4 model to study the trade-off between the performance and complexity. We explore how well this model generalizes to a new data set beyond Randomized Controlled Trials (RCT) dataset. We address the question that whether word embeddings can be adjusted to the task to improve the performance. Furthermore, we develop a second model ...
How Smart Should a Forager Be?.
EN: We introduce an idealized model of an intelligent forager in which higher intelligence corresponds to a larger spatial range over which the forager can detect food. Such a forager diffuses randomly whenever the nearest food is more distant than the forager's detection range, $R$, and moves ballistically towards the nearest food inside its detection range. Concomitantly, the forager's metabolic energy cost per step is an increasing function of its intelligence. A dumb forager wanders randomly and may miss nearby food, thus making it susceptible to starvation. Conversely, a too-smart forager incurs a large metabolic cost per step during its search for food and is again susceptible to starvation. We show that the forager's lifetime is maximized at an optimal, intermediate level of intelligence.
Detection of Increased Time Intervals of Anti-Vaccine Tweets for COVID-19 Vaccine with BERT Model.
EN: The most effective of the solutions against Covid-19 is the various vaccines developed. Distrust of vaccines can hinder the rapid and effective use of this remedy. One of the means of expressing the thoughts of society is social media. Determining the time intervals during which anti-vaccination increases in social media can help institutions determine the strategy to be used in combating anti-vaccination. Recording and tracking every tweet entered with human labor would be inefficient, so various automation solutions are needed. In this study, The Bidirectional Encoder Representations from Transformers (BERT) model, which is a deep learning-based natural language processing (NLP) model, was used. In a dataset of 1506 tweets divided into four different categories as news, irrelevant, anti-vaccine, and vaccine supporters, the model was trained with a learning rate of 5e-6 for 25 epochs. To determine the intervals in which anti-vaccine tweets are concentrated, the categories to which 652840 tweets belong were determined by using the trained model. The change of the determined categories overtime was visualized and the events that could cause the change were determined. As a result of...
Dynamics of polydisperse multiple emulsions in microfluidic channels.
EN: Multiple emulsions are a class of soft fluid in which small drops are immersed within a larger one and stabilized over long periods of time by a surfactant. We recently showed that, if a monodisperse multiple emulsion is subject to a pressure-driven flow, a wide variety of nonequilibrium steady states emerges at late times, whose dynamics relies on a complex interplay between hydrodynamic interactions and multibody collisions among internal drops. In this work, we use lattice Boltzmann simulations to study the dynamics of polydisperse double emulsions driven by a Poiseuille flow within a microfluidic channel. Our results show that their behavior is critically affected by multiple factors, such as initial position, polydispersity index, and area fraction occupied within the emulsion. While at low area fraction inner drops may exhibit either a periodic rotational motion (at low polydispersity) or arrange into nonmotile configurations (at high polydispersity) located far from each other, at larger values of area fraction they remain in tight contact and move unidirectionally. This decisively conditions their close-range dynamics, quantitatively assessed through a time-efficiency-like ...
Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision.
EN: UAV-based image retrieval in modern agriculture enables gathering large amounts of spatially referenced crop image data. In large-scale experiments, however, UAV images suffer from containing a multitudinous amount of crops in a complex canopy architecture. Especially for the observation of temporal effects, this complicates the recognition of individual plants over several images and the extraction of relevant information tremendously. In this work, we present a hands-on workflow for the automatized temporal and spatial identification and individualization of crop images from UAVs abbreviated as "cataloging" based on comprehensible computer vision methods. We evaluate the workflow on two real-world datasets. One dataset is recorded for observation of Cercospora leaf spot - a fungal disease - in sugar beet over an entire growing cycle. The other one deals with harvest prediction of cauliflower plants. The plant catalog is utilized for the extraction of single plant images seen over multiple time points. This gathers large-scale spatio-temporal image dataset that in turn can be applied to train further machine learning models including various data layers. The presented approach imp...
Applying Machine Learning and AI Explanations to Analyze Vaccine Hesitancy.
EN: The paper quantifies the impact of race, poverty, politics, and age on COVID-19 vaccination rates in counties in the continental US. Both, OLS regression analysis and Random Forest machine learning algorithms are applied to quantify factors for county-level vaccination hesitancy. The machine learning model considers joint effects of variables (race/ethnicity, partisanship, age, etc.) simultaneously to capture the unique combination of these factors on the vaccination rate. By implementing a state-of-the-art Artificial Intelligence Explanations (AIX) algorithm, it is possible to solve the black box problem with machine learning models and provide answers to the "how much" question for each measured impact factor in every county. For most counties, a higher percentage vote for Republicans, a greater African American population share, and a higher poverty rate lower the vaccination rate. While a higher Asian population share increases the predicted vaccination rate. The impact on the vaccination rate from the Hispanic population proportion is positive in the OLS model, but only positive for counties with a high Hispanic population (>65%) in the Random Forest model. Both the proportion...
Meta-analysis of commercial-scale trials as a means to improve decision-making processes in the poultry industry: a phytogenic feed additive case study.
EN: Background and Objective: In the current study, we sought to determine the value of a meta-analysis to improve decision-making processes related to nutrition in the poultry industry. To this end, nine commercial size experiments were conducted to test the effect of a phytogenic feed additive and three approaches were applied to the data. Materials and Methods: In all experiments, 1-day-old male Cobb 500 chicks were used and fed corn-soybean meal diets. Two dietary treatments were tested: T1, control diet and T2, control diet + feed additive at a 0.05% inclusion rate. The experimental units were broiler houses (7 experiments), floor pens (1 experiment) and cages (1 experiment). The response variables were final body weight, feed intake, feed conversion ratio, mortality and production efficiency. Analyses of variance of data from each and all the experiments were performed using SAS under completely randomized non-blocked or blocked designs, respectively. The meta-analyses were performed in R programming language. Results: No statistically significant effects were found in the evaluated variables in any of the independent experiments (p>0.12), nor following the application of a block...
ExAID: A Multimodal Explanation Framework for Computer-Aided Diagnosis of Skin Lesions.
EN: One principal impediment in the successful deployment of AI-based Computer-Aided Diagnosis (CAD) systems in clinical workflows is their lack of transparent decision making. Although commonly used eXplainable AI methods provide some insight into opaque algorithms, such explanations are usually convoluted and not readily comprehensible except by highly trained experts. The explanation of decisions regarding the malignancy of skin lesions from dermoscopic images demands particular clarity, as the underlying medical problem definition is itself ambiguous. This work presents ExAID (Explainable AI for Dermatology), a novel framework for biomedical image analysis, providing multi-modal concept-based explanations consisting of easy-to-understand textual explanations supplemented by visual maps justifying the predictions. ExAID relies on Concept Activation Vectors to map human concepts to those learnt by arbitrary Deep Learning models in latent space, and Concept Localization Maps to highlight concepts in the input space. This identification of relevant concepts is then used to construct fine-grained textual explanations supplemented by concept-wise location information to provide comprehen...
Molecular Plasmon Hybridizition in Olefin Chains.
EN: With the continuous emergence of molecular and cluster devices or systems, the relationship between the plasmonic properties of multiple clusters and molecular interactions and the properties of the original single cluster or molecule becomes more and more important. Similar to plasmonic nanoparticle hybridization, there is also a hybrid phenomenon between two molecules with plasmon excitation modes. Using linear response time-dependent density functional theory (LR-TDDFT) and real-time propagation time-dependent density functional theory (RT-TDDFT) and combining the plasmonicity index (PI) and the transition contribution maps (TCM) methods we identify the plasmon excitation mode in the small molecular olefin chains with -OH and -NH2 groups and analyze the hybridization characteristics using charge transitions. The results show that for the plasmons in molecules, there are also plasmon hybridization mechanism exist when the two molecules coupling together. The TCM analysis shows that the plasmon modes and hybridization is a result of coexist of collective and single particle excitation. When there is extra charge depose in the molecules, as the electrons can moving in the whole mol...
An Ensemble of Pre-trained Transformer Models For Imbalanced Multiclass Malware Classification.
EN: Classification of malware families is crucial for a comprehensive understanding of how they can infect devices, computers, or systems. Thus, malware identification enables security researchers and incident responders to take precautions against malware and accelerate mitigation. API call sequences made by malware are widely utilized features by machine and deep learning models for malware classification as these sequences represent the behavior of malware. However, traditional machine and deep learning models remain incapable of capturing sequence relationships between API calls. On the other hand, the transformer-based models process sequences as a whole and learn relationships between API calls due to multi-head attention mechanisms and positional embeddings. Our experiments demonstrate that the transformer model with one transformer block layer surpassed the widely used base architecture, LSTM. Moreover, BERT or CANINE, pre-trained transformer models, outperformed in classifying highly imbalanced malware families according to evaluation metrics, F1-score, and AUC score. Furthermore, the proposed bagging-based random transformer forest (RTF), an ensemble of BERT or CANINE, has re...
AI-Bind: Improving Binding Predictions for Novel Protein Targets and Ligands.
EN: Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We first unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Then, we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training, allowing us to limit the annotation imbalance and improve binding predictions for novel proteins and ligands. We illustrate the value of AI-Bind by predicting drugs and natural compounds with binding affinity to SARS-CoV-2 viral proteins and the associated human proteins. We also validate these predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. Overall, AI-Bind of...
Mapping industrial poultry operations at scale with deep learning and aerial imagery.
EN: Concentrated Animal Feeding Operations (CAFOs) pose serious risks to air, water, and public health, but have proven to be challenging to regulate. The U.S. Government Accountability Office notes that a basic challenge is the lack of comprehensive location information on CAFOs. We use the USDA's National Agricultural Imagery Program (NAIP) 1m/pixel aerial imagery to detect poultry CAFOs across the continental United States. We train convolutional neural network (CNN) models to identify individual poultry barns and apply the best performing model to over 42 TB of imagery to create the first national, open-source dataset of poultry CAFOs. We validate the model predictions against held-out validation set on poultry CAFO facility locations from 10 hand-labeled counties in California and demonstrate that this approach has significant potential to fill gaps in environmental monitoring.
Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing.
EN: Motivation: A perennial challenge for biomedical researchers and clinical practitioners is to stay abreast with the rapid growth of publications and medical notes. Natural language processing (NLP) has emerged as a promising direction for taming information overload. In particular, large neural language models facilitate transfer learning by pretraining on unlabeled text, as exemplified by the successes of BERT models in various NLP applications. However, fine-tuning such models for an end task remains challenging, especially with small labeled datasets, which are common in biomedical NLP. Results: We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains. Large models have potential to attain better performance, but increasing model size also exacerbates finetuning instability. We thus conduct a comprehensive exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications. Specifically, freezing lower layers is helpful for stand...
Fine-Tuning Large Neural Language Models for Biomedical Natural Language Processing.
EN: Motivation: A perennial challenge for biomedical researchers and clinical practitioners is to stay abreast with the rapid growth of publications and medical notes. Natural language processing (NLP) has emerged as a promising direction for taming information overload. In particular, large neural language models facilitate transfer learning by pretraining on unlabeled text, as exemplified by the successes of BERT models in various NLP applications. However, fine-tuning such models for an end task remains challenging, especially with small labeled datasets, which are common in biomedical NLP. Results: We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that finetuning performance may be sensitive to pretraining settings, especially in low-resource domains. Large models have potential to attain better performance, but increasing model size also exacerbates finetuning instability. We thus conduct a comprehensive exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for lowresource biomedical NLP applications. Specifically, freezing lower layers is helpful for stand...
An Empirical Study on Relation Extraction in the Biomedical Domain.
EN: Relation extraction is a fundamental problem in natural language processing. Most existing models are defined for relation extraction in the general domain. However, their performance on specific domains (e.g., biomedicine) is yet unclear. To fill this gap, this paper carries out an empirical study on relation extraction in biomedical research articles. Specifically, we consider both sentence-level and document-level relation extraction, and run a few state-of-the-art methods on several benchmark datasets. Our results show that (1) current document-level relation extraction methods have strong generalization ability; (2) existing methods require a large amount of labeled data for model fine-tuning in biomedicine. Our observations may inspire people in this field to develop more effective models for biomedical relation extraction.
An Empirical Study on Relation Extraction in the Biomedical Domain.
EN: Relation extraction is a fundamental problem in natural language processing. Most existing models are defined for relation extraction in the general domain. However, their performance on specific domains (e.g., biomedicine) is yet unclear. To fill this gap, this paper carries out an empirical study on relation extraction in biomedical research articles. Specifically, we consider both sentence-level and document-level relation extraction, and run a few state-of-the-art methods on several benchmark datasets. Our results show that (1) current document-level relation extraction methods have strong generalization ability; (2) existing methods require a large amount of labeled data for model fine-tuning in biomedicine. Our observations may inspire people in this field to develop more effective models for biomedical relation extraction.
Enhancing Counterfactual Classification via Self-Training.
EN: Unlike traditional supervised learning, in many settings only partial feedback is available. We may only observe outcomes for the chosen actions, but not the counterfactual outcomes associated with other alternatives. Such settings encompass a wide variety of applications including pricing, online marketing and precision medicine. A key challenge is that observational data are influenced by historical policies deployed in the system, yielding a biased data distribution. We approach this task as a domain adaptation problem and propose a self-training algorithm which imputes outcomes with categorical values for finite unseen actions in the observational data to simulate a randomized trial through pseudolabeling, which we refer to as Counterfactual Self-Training (CST). CST iteratively imputes pseudolabels and retrains the model. In addition, we show input consistency loss can further improve CST performance which is shown in recent theoretical analysis of pseudolabeling. We demonstrate the effectiveness of the proposed algorithms on both synthetic and real datasets.
Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians.
EN: Extra-label drug use in food animal medicine is authorized by the US Animal Medicinal Drug Use Clarification Act (AMDUCA), and estimated withdrawal intervals are based on published scientific pharmacokinetic data. Occasionally there is a paucity of scientific data on which to base a withdrawal interval or a large number of animals being treated, driving the need to test for drug residues. Rapid assay commercial farm-side tests are essential for monitoring drug residues in animal products to protect human health. Active ingredients, sensitivity, matrices, and species that have been evaluated for commercial rapid assay tests are typically reported on manufacturers' websites or in PDF documents that are available to consumers but may require a special access request. Additionally, this information is not always correlated with FDA-approved tolerances. Furthermore, parameter changes for these tests can be very challenging to regularly identify, especially those listed on websites or in documents that are not publicly available. Therefore, artificial intelligence plays a critical role in efficiently extracting the data and ensure current information. Extracting tables from PDF and HTML ...
Dimensionality Reduction of Longitudinal 'Omics Data using Modern Tensor Factorization.
EN: Precision medicine is a clinical approach for disease prevention, detection and treatment, which considers each individual's genetic background, environment and lifestyle. The development of this tailored avenue has been driven by the increased availability of omics methods, large cohorts of temporal samples, and their integration with clinical data. Despite the immense progression, existing computational methods for data analysis fail to provide appropriate solutions for this complex, high-dimensional and longitudinal data. In this work we have developed a new method termed TCAM, a dimensionality reduction technique for multi-way data, that overcomes major limitations when doing trajectory analysis of longitudinal omics data. Using real-world data, we show that TCAM outperforms traditional methods, as well as state-of-the-art tensor-based approaches for longitudinal microbiome data analysis. Moreover, we demonstrate the versatility of TCAM by applying it to several different omics datasets, and the applicability of it as a drop-in replacement within straightforward ML tasks.
Deep Molecular Representation Learning via Fusing Physical and Chemical Information.
EN: Molecular representation learning is the first yet vital step in combining deep learning and molecular science. To push the boundaries of molecular representation learning, we present PhysChem, a novel neural architecture that learns molecular representations via fusing physical and chemical information of molecules. PhysChem is composed of a physicist network (PhysNet) and a chemist network (ChemNet). PhysNet is a neural physical engine that learns molecular conformations through simulating molecular dynamics with parameterized forces; ChemNet implements geometry-aware deep message-passing to learn chemical / biomedical properties of molecules. Two networks specialize in their own tasks and cooperate by providing expertise to each other. By fusing physical and chemical information, PhysChem achieved state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark. The effectiveness of PhysChem was further corroborated on cutting-edge datasets of SARS-CoV-2.
Deep Molecular Representation Learning via Fusing Physical and Chemical Information.
EN: Molecular representation learning is the first yet vital step in combining deep learning and molecular science. To push the boundaries of molecular representation learning, we present PhysChem, a novel neural architecture that learns molecular representations via fusing physical and chemical information of molecules. PhysChem is composed of a physicist network (PhysNet) and a chemist network (ChemNet). PhysNet is a neural physical engine that learns molecular conformations through simulating molecular dynamics with parameterized forces; ChemNet implements geometry-aware deep message-passing to learn chemical / biomedical properties of molecules. Two networks specialize in their own tasks and cooperate by providing expertise to each other. By fusing physical and chemical information, PhysChem achieved state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark. The effectiveness of PhysChem was further corroborated on cutting-edge datasets of SARS-CoV-2.
Disparities in Dermatology AI: Assessments Using Diverse Clinical Images.
EN: More than 3 billion people lack access to care for skin disease. AI diagnostic tools may aid in early skin cancer detection; however most models have not been assessed on images of diverse skin tones or uncommon diseases. To address this, we curated the Diverse Dermatology Images (DDI) dataset - the first publicly available, pathologically confirmed images featuring diverse skin tones. We show that state-of-the-art dermatology AI models perform substantially worse on DDI, with ROC-AUC dropping 29-40 percent compared to the models' original results. We find that dark skin tones and uncommon diseases, which are well represented in the DDI dataset, lead to performance drop-offs. Additionally, we show that state-of-the-art robust training methods cannot correct for these biases without diverse training data. Our findings identify important weaknesses and biases in dermatology AI that need to be addressed to ensure reliable application to diverse patients and across all disease.
RapidRead: Global Deployment of State-of-the-art Radiology AI for a Large Veterinary Teleradiology Practice.
EN: This work describes the development and real-world deployment of a deep learning-based AI system for evaluating canine and feline radiographs across a broad range of findings and abnormalities. We describe a new semi-supervised learning approach that combines NLP-derived labels with self-supervised training leveraging more than 2.5 million x-ray images. Finally we describe the clinical deployment of the model including system architecture, real-time performance evaluation and data drift detection.
Structure-aware generation of drug-like molecules.
EN: Structure-based drug design involves finding ligand molecules that exhibit structural and chemical complementarity to protein pockets. Deep generative methods have shown promise in proposing novel molecules from scratch (de-novo design), avoiding exhaustive virtual screening of chemical space. Most generative de-novo models fail to incorporate detailed ligand-protein interactions and 3D pocket structures. We propose a novel supervised model that generates molecular graphs jointly with 3D pose in a discretised molecular space. Molecules are built atom-by-atom inside pockets, guided by structural information from crystallographic data. We evaluate our model using a docking benchmark and find that guided generation improves predicted binding affinities by 8% and drug-likeness scores by 10% over the baseline. Furthermore, our model proposes molecules with binding scores exceeding some known ligands, which could be useful in future wet-lab studies.
SPECTRe: Substructure Processing, Enumeration, and Comparison Tool Resource: An efficient tool to encode all substructures of molecules represented in SMILES.
EN: Functional groups and moieties are chemical descriptors of biomolecules that can be used to interpret their properties and functions, leading to the understanding of chemical or biological mechanisms. These chemical building blocks, or sub-structures, enable the identification of common molecular subgroups, assessing the structural similarities and critical interactions among a set of biological molecules with known activities, and designing novel compounds with similar chemical properties. Here, we introduce a Python-based tool, SPECTRe (Substructure Processing, Enumeration, and Comparison Tool Resource), designed to provide all substructures in a given molecular structure, regardless of the molecule size, employing efficient enumeration and generation of substructures represented in a human-readable SMILES format through the use of classical graph traversal (breadth-first and depth-first search) algorithms. We demonstrate the application of SPECTRe for a set of 10,375 molecules in the molecular weight range 27 to 350 Da (<=26 non-hydrogen atoms), spanning a wide array of structure-based chemical functionalities and chemical classes. We found that the substructure count as a measu...
Efficient Learning of Quadratic Variance Function Directed Acyclic Graphs via Topological Layers.
EN: Directed acyclic graph (DAG) models are widely used to represent causal relationships among random variables in many application domains. This paper studies a special class of non-Gaussian DAG models, where the conditional variance of each node given its parents is a quadratic function of its conditional mean. Such a class of non-Gaussian DAG models are fairly flexible and admit many popular distributions as special cases, including Poisson, Binomial, Geometric, Exponential, and Gamma. To facilitate learning, we introduce a novel concept of topological layers, and develop an efficient DAG learning algorithm. It first reconstructs the topological layers in a hierarchical fashion and then recoveries the directed edges between nodes in different layers, which requires much less computational cost than most existing algorithms in literature. Its advantage is also demonstrated in a number of simulated examples, as well as its applications to two real-life datasets, including an NBA player statistics data and a cosmetic sales data collected by Alibaba.
Influential Prototypical Networks for Few Shot Learning: A Dermatological Case Study.
EN: Prototypical network (PN) is a simple yet effective few shot learning strategy. It is a metric-based meta-learning technique where classification is performed by computing Euclidean distances to prototypical representations of each class. Conventional PN attributes equal importance to all samples and generates prototypes by simply averaging the support sample embeddings belonging to each class. In this work, we propose a novel version of PN that attributes weights to support samples corresponding to their influence on the support sample distribution. Influence weights of samples are calculated based on maximum mean discrepancy (MMD) between the mean embeddings of sample distributions including and excluding the sample. Comprehensive evaluation of our proposed influential PN (IPNet) is performed by comparing its performance with other baseline PNs on three different benchmark dermatological datasets. IPNet outperforms all baseline models with compelling results across all three datasets and various N-way, K-shot classification tasks. Findings from cross-domain adaptation experiments further establish the robustness and generalizability of IPNet.
Progressive observation of Covid-19 vaccination effects on skin-cellular structures by use of Intelligent Laser Speckle Classification (ILSC).
EN: We have made a progressive observation of Covid-19 Astra Zeneca Vaccination effect on Skin cellular network and properties by use of well established Intelligent Laser Speckle Classification (ILSC) image based technique and managed to distinguish between three different subjects groups via their laser speckle skin image samplings such as early-vaccinated, late-vaccinated and non-vaccinated individuals. The results have proven that the ILSC technique in association with the optimised Bayesian network is capable of classifying skin changes of vaccinated and non-vaccinated individuals and also of detecting progressive development made on skin cellular properties for a month period.
A fast accurate fine-grain object detection model based on YOLOv4 deep neural network.
EN: Early identification and prevention of various plant diseases in commercial farms and orchards is a key feature of precision agriculture technology. This paper presents a high-performance real-time fine-grain object detection framework that addresses several obstacles in plant disease detection that hinder the performance of traditional methods, such as, dense distribution, irregular morphology, multi-scale object classes, textural similarity, etc. The proposed model is built on an improved version of the You Only Look Once (YOLOv4) algorithm. The modified network architecture maximizes both detection accuracy and speed by including the DenseNet in the back-bone to optimize feature transfer and reuse, two new residual blocks in the backbone and neck enhance feature extraction and reduce computing cost; the Spatial Pyramid Pooling (SPP) enhances receptive field, and a modified Path Aggregation Network (PANet) preserves fine-grain localized information and improve feature fusion. Additionally, the use of the Hard-Swish function as the primary activation improved the model's accuracy due to better nonlinear feature extraction. The proposed model is tested in detecting four different d...
DOCKSTRING: easy molecular docking yields better benchmarks for ligand design.
EN: The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate's interaction with the target. By contrast, molecular docking is a widely successful method in drug discovery to estimate binding affinities. However, docking simulations require a significant amount of domain knowledge to set up correctly which hampers adoption. To this end, we present DOCKSTRING, a bundle for meaningful and robust comparison of ML models consisting of three components: (1) an open-source Python package for straightforward computation of docking scores; (2) an extensive dataset of docking scores and poses of more than 260K ligands for 58 medically-relevant targets; and (3) a set of pharmaceutically-relevant benchmark tasks including regression, virtual screening, and de novo design. The Python package implements a robust ligand and target preparation protocol that allows non-experts to ob...
Generating 3D Molecules Conditional on Receptor Binding Sites with Deep Generative Models.
EN: The goal of structure-based drug discovery is to find small molecules that bind to a given target protein. Deep learning has been used to generate drug-like molecules with certain cheminformatic properties, but has not yet been applied to generating 3D molecules predicted to bind to proteins by sampling the conditional distribution of protein-ligand binding interactions. In this work, we describe for the first time a deep learning system for generating 3D molecular structures conditioned on a receptor binding site. We approach the problem using a conditional variational autoencoder trained on an atomic density grid representation of cross-docked protein-ligand structures. We apply atom fitting and bond inference procedures to construct valid molecular conformations from generated atomic densities. We evaluate the properties of the generated molecules and demonstrate that they change significantly when conditioned on mutated receptors. We also explore the latent space learned by our generative model using sampling and interpolation techniques. This work opens the door for end-to-end prediction of stable bioactive molecules from protein structures with deep learning.
Abstractified Multi-instance Learning (AMIL) for Biomedical Relation Extraction.
EN: Relation extraction in the biomedical domain is a challenging task due to a lack of labeled data and a long-tail distribution of fact triples. Many works leverage distant supervision which automatically generates labeled data by pairing a knowledge graph with raw textual data. Distant supervision produces noisy labels and requires additional techniques, such as multi-instance learning (MIL), to denoise the training signal. However, MIL requires multiple instances of data and struggles with very long-tail datasets such as those found in the biomedical domain. In this work, we propose a novel reformulation of MIL for biomedical relation extraction that abstractifies biomedical entities into their corresponding semantic types. By grouping entities by types, we are better able to take advantage of the benefits of MIL and further denoise the training signal. We show this reformulation, which we refer to as abstractified multi-instance learning (AMIL), improves performance in biomedical relationship extraction. We also propose a novel relationship embedding architecture that further improves model performance.
Abstractified Multi-instance Learning (AMIL) for Biomedical Relation Extraction.
EN: Relation extraction in the biomedical domain is a challenging task due to a lack of labeled data and a long-tail distribution of fact triples. Many works leverage distant supervision which automatically generates labeled data by pairing a knowledge graph with raw textual data. Distant supervision produces noisy labels and requires additional techniques, such as multi-instance learning (MIL), to denoise the training signal. However, MIL requires multiple instances of data and struggles with very long-tail datasets such as those found in the biomedical domain. In this work, we propose a novel reformulation of MIL for biomedical relation extraction that abstractifies biomedical entities into their corresponding semantic types. By grouping entities by types, we are better able to take advantage of the benefits of MIL and further denoise the training signal. We show this reformulation, which we refer to as abstractified multi-instance learning (AMIL), improves performance in biomedical relationship extraction. We also propose a novel relationship embedding architecture that further improves model performance.
Fragmentation Statistics of Food Diced and Crushed Using a Food Mixer.
EN: The fragment-size distributions of raw carrot diced or crushed using a food mixer are studied experimentally. For the 5-mm-square raw carrot, the normal distribution shows a characteristic feature of food fragmentation statistics. This simple result indicates that most random errors contribute to fragment-size fluctuation. On the other hand, for the crushed raw carrot, the cumulative fragment size distribution follows the power law where the exponent $α\simeq 1.62 > 1$. Furthermore, considering the cumulative fragment-size distribution as a function of length for comparison with geomaterials, such as fault rocks, the exponent $D \simeq 3.64$. Previous studies have shown that the power-law distribution observed in sequential fragmentation tends to have a large exponent value. As our experiment is also based on sequential fragmentation, the obtained large values of exponents $α$ and $D$ are consistent with those obtained in previous studies on sequential fragmentation. On the basis of previous studies and our observations, we discuss the effect of the preferential fragmentation of particles as large as the mixer blades. We also discuss the existence of a lower limit beyond which furt...
Deep Learning Model of Dock by Dock Process Significantly Accelerate the Process of Docking-based Virtual Screening.
EN: Docking-based virtual screening (VS process) selects ligands with potential pharmacological activities from millions of molecules using computational docking methods, which greatly could reduce the number of compounds for experimental screening, shorten the research period and save the research cost. Howerver, a majority of compouds with low docking scores could waste most of the computational resources. Herein, we report a novel and practical docking-based machine learning method called MLDDM (Machince Learning Docking-by-Docking Models). It is composed of a regression model and a classification model that simulates a classical docking by docking protocol ususally applied in many virtual screening projects. MLDDM could quickly eliminate compounds with low docking scores and the retained compounds with potential high docking scores would be examined for further real docking program. We demonstrated that MLDDM has a good ability to identify active compounds in the case studies for 10 specific protein targets. Compared to pure docking by docking based VS protocol, the VS process with MLDDM can achieve an over 120 times speed increment on average and the consistency rate with correspo...
Vaccine skepticism detection by network embedding.
EN: We demonstrate the applicability of network embedding to vaccine skepticism, a controversial topic of long-past history. With the Covid-19 pandemic outbreak at the end of 2019, the topic is more important than ever. Only a year after the first international cases were registered, multiple vaccines were developed and passed clinical testing. Besides the challenges of development, testing, and logistics, another factor that might play a significant role in the fight against the pandemic are people who are hesitant to get vaccinated, or even state that they will refuse any vaccine offered to them. Two groups of people commonly referred to as a) pro-vaxxer, those who support vaccinating people b) vax-skeptic, those who question vaccine efficacy or the need for general vaccination against Covid-19. It is very difficult to tell exactly how many people share each of these views. It is even more difficult to understand all the reasoning why vax-skeptic opinions are getting more popular. In this work, our intention was to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic content. After multiple data preprocessing steps, we analyzed the tweet te...
Using Clinical Drug Representations for Improving Mortality and Length of Stay Predictions.
EN: Drug representations have played an important role in cheminformatics. However, in the healthcare domain, drug representations have been underused relative to the rest of Electronic Health Record (EHR) data, due to the complexity of high dimensional drug representations and the lack of proper pipeline that will allow to convert clinical drugs to their representations. Time-varying vital signs, laboratory measurements, and related time-series signals are commonly used to predict clinical outcomes. In this work, we demonstrated that using clinical drug representations in addition to other clinical features has significant potential to increase the performance of mortality and length of stay (LOS) models. We evaluate the two different drug representation methods (Extended-Connectivity Fingerprint-ECFP and SMILES-Transformer embedding) on clinical outcome predictions. The results have shown that the proposed multimodal approach achieves substantial enhancement on clinical tasks over baseline models. Using clinical drug representations as additional features improve the LOS prediction for Area Under the Receiver Operating Characteristics (AUROC) around %6 and for Area Under Precision-Re...
Elastic Shape Analysis of Tree-like 3D Objects using Extended SRVF Representation.
EN: How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results fr...
Automated Feature-Specific Tree Species Identification from Natural Images using Deep Semi-Supervised Learning.
EN: Prior work on plant species classification predominantly focuses on building models from isolated plant attributes. Hence, there is a need for tools that can assist in species identification in the natural world. We present a novel and robust two-fold approach capable of identifying trees in a real-world natural setting. Further, we leverage unlabelled data through deep semi-supervised learning and demonstrate superior performance to supervised learning. Our single-GPU implementation for feature recognition uses minimal annotated data and achieves accuracies of 93.96% and 93.11% for leaves and bark, respectively. Further, we extract feature-specific datasets of 50 species by employing this technique. Finally, our semi-supervised species classification method attains 94.04% top-5 accuracy for leaves and 83.04% top-5 accuracy for bark.
Automated Aerial Animal Detection When Spatial Resolution Conditions Are Varied.
EN: Knowing where livestock are located enables optimized management and mustering. However, Australian farms are large meaning that many of Australia's livestock are unmonitored which impacts farm profit, animal welfare and the environment. Effective animal localisation and counting by analysing satellite imagery overcomes this management hurdle however, high resolution satellite imagery is expensive. Thus, to minimise cost the lowest spatial resolution data that enables accurate livestock detection should be selected. In our work, we determine the association between object detector performance and spatial degradation for cattle, sheep and dogs. Accurate ground truth was established using high resolution drone images which were then downsampled to various ground sample distances (GSDs). Both circular and cassegrain aperture optics were simulated to generate point spread functions (PSFs) corresponding to various optical qualities. By simulating the PSF, rather than approximating it as a Gaussian, the images were accurately degraded to match the spatial resolution and blurring structure of satellite imagery. Two existing datasets were combined and used to train and test a YoloV5 obje...
Coreference Resolution for the Biomedical Domain: A Survey.
EN: Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state-of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.
Coreference Resolution for the Biomedical Domain: A Survey.
EN: Issues with coreference resolution are one of the most frequently mentioned challenges for information extraction from the biomedical literature. Thus, the biomedical genre has long been the second most researched genre for coreference resolution after the news domain, and the subject of a great deal of research for NLP in general. In recent years this interest has grown enormously leading to the development of a number of substantial datasets, of domain-specific contextual language models, and of several architectures. In this paper we review the state-of-the-art of coreference in the biomedical domain with a particular attention on these most recent developments.
Vaccine allocation policy optimization and budget sharing mechanism using Thompson sampling.
EN: The optimal allocation of vaccines to population subgroups over time is a challenging health care management problem. In the context of a pandemic, the interaction between vaccination policies adopted by multiple agents and the cooperation (or lack thereof) creates a complex environment that affects the global transmission dynamics of the disease. In this study, we take the perspective of decision-making agents that aim to minimize the size of their susceptible populations and must allocate vaccine under limited supply. We assume that vaccine efficiency rates are unknown to agents and we propose an optimization policy based on Thompson sampling to learn mean vaccine efficiency rates over time. Furthermore, we develop a budget-balanced resource sharing mechanism to promote cooperation among agents. We apply the proposed framework to the COVID-19 pandemic. We use a raster model of the world where agents represent the main countries worldwide and interact in a global mobility network to generate multiple problem instances. Our numerical results show that the proposed vaccine allocation policy achieves a larger reduction in the number of susceptible individuals, infections and deaths g...
Programming and Training Rate-Independent Chemical Reaction Networks.
EN: Embedding computation in biochemical environments incompatible with traditional electronics is expected to have wide-ranging impact in synthetic biology, medicine, nanofabrication and other fields. Natural biochemical systems are typically modeled by chemical reaction networks (CRNs), and CRNs can be used as a specification language for synthetic chemical computation. In this paper, we identify a class of CRNs called non-competitive (NC) whose equilibria are absolutely robust to reaction rates and kinetic rate law, because their behavior is captured solely by their stoichiometric structure. Unlike prior work on rate-independent CRNs, checking non-competition and using it as a design criterion is easy and promises robust output. We also present a technique to program NC-CRNs using well-founded deep learning methods, showing a translation procedure from rectified linear unit (ReLU) neural networks to NC-CRNs. In the case of binary weight ReLU networks, our translation procedure is surprisingly tight in the sense that a single bimolecular reaction corresponds to a single ReLU node and vice versa. This compactness argues that neural networks may be a fitting paradigm for programming ra...
Multilingual Molecular Representation Learning via Contrastive Pre-training.
EN: Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
Multilingual Molecular Representation Learning via Contrastive Pre-training.
EN: Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
Multilingual Molecular Representation Learning via Contrastive Pre-training.
EN: Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
Multilingual Molecular Representation Learning via Contrastive Pre-training.
EN: Molecular representation learning plays an essential role in cheminformatics. Recently, language model-based approaches have gained popularity as an alternative to traditional expert-designed features to encode molecules. However, these approaches only utilize a single molecular language for representation learning. Motivated by the fact that a given molecule can be described using different languages such as Simplified Molecular Line Entry System (SMILES), The International Union of Pure and Applied Chemistry (IUPAC), and The IUPAC International Chemical Identifier (InChI), we propose a multilingual molecular embedding generation approach called MM-Deacon (multilingual molecular domain embedding analysis via contrastive learning). MM-Deacon is pre-trained using SMILES and IUPAC as two different languages on large-scale molecules. We evaluated the robustness of our method on seven molecular property prediction tasks from MoleculeNet benchmark, zero-shot cross-lingual retrieval, and a drug-drug interaction prediction task.
Slot Filling for Biomedical Information Extraction.
EN: Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in domains with limited resources such as biomedicine. In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing us to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reading comprehension model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and dataset...
Slot Filling for Biomedical Information Extraction.
EN: Information Extraction (IE) from text refers to the task of extracting structured knowledge from unstructured text. The task typically consists of a series of sub-tasks such as Named Entity Recognition and Relation Extraction. Sourcing entity and relation type specific training data is a major bottleneck in domains with limited resources such as biomedicine. In this work we present a slot filling approach to the task of biomedical IE, effectively replacing the need for entity and relation-specific training data, allowing us to deal with zero-shot settings. We follow the recently proposed paradigm of coupling a Tranformer-based bi-encoder, Dense Passage Retrieval, with a Transformer-based reading comprehension model to extract relations from biomedical text. We assemble a biomedical slot filling dataset for both retrieval and reading comprehension and conduct a series of experiments demonstrating that our approach outperforms a number of simpler baselines. We also evaluate our approach end-to-end for standard as well as zero-shot settings. Our work provides a fresh perspective on how to solve biomedical IE tasks, in the absence of relevant training data. Our code, models and dataset...
Modelling control strategies against Classical Swine Fever: influence of traders and markets using static and temporal networks in Ecuador.
EN: Classical swine fever (CSF) in Ecuador is prevalent since 1940, pig farming represents an important economic and cultural sector. Recently, the National Veterinary Service (NVS) has implemented individual identification of pigs, movement control and mandatory vaccination against CSF, looking for a future eradication. Our aim was to characterise the pig premises according to risk criteria, analyse the effect of random and targeted strategies to control CSF and consider the temporal development of the network. We used social network analysis (SNA), SIRS (susceptible, infected, recovered, susceptible) network modelling and temporal network analysis. The data set contained 751,003 shipments and 6 million pigs from 2017 to 2019. 165,593 premises were involved: 144,118 farms, 138 industrials, 21,337 traders, and 51 markets. On annual average, 124,976 premises (75%) received or sent one movement with 1.5 pigs, in contrast, 166 (0.01%) with 1,372 movements and 11,607 pigs. Simulations resulted in CSF mean prevalence of 29.93%; Targeted selection strategy reduced the prevalence to 3.3%, while 24% with random selection. Selection of high-risk premises in every province was the best strategy ...
Deep Denerative Models for Drug Design and Response.
EN: Designing new chemical compounds with desired pharmaceutical properties is a challenging task and takes years of development and testing. Still, a majority of new drugs fail to prove efficient. Recent success of deep generative modeling holds promises of generation and optimization of new molecules. In this review paper, we provide an overview of the current generative models, and describe necessary biological and chemical terminology, including molecular representations needed to understand the field of drug design and drug response. We present commonly used chemical and biological databases, and tools for generative modeling. Finally, we summarize the current state of generative modeling for drug design and drug response prediction, highlighting the state-of-art approaches and limitations the field is currently facing.
Inverse design of 3d molecular structures with conditional generative neural networks.
EN: The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
Inverse design of 3d molecular structures with conditional generative neural networks.
EN: The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
Inverse design of 3d molecular structures with conditional generative neural networks.
EN: The rational design of molecules with desired properties is a long-standing challenge in chemistry. Generative neural networks have emerged as a powerful approach to sample novel molecules from a learned distribution. Here, we propose a conditional generative neural network for 3d molecular structures with specified chemical and structural properties. This approach is agnostic to chemical bonding and enables targeted sampling of novel molecules from conditional distributions, even in domains where reference calculations are sparse. We demonstrate the utility of our method for inverse design by generating molecules with specified motifs or composition, discovering particularly stable molecules, and jointly targeting multiple electronic properties beyond the training regime.
Emerging vaccine-breakthrough SARS-CoV-2 variants.
EN: The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, which is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germ...
IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System.
EN: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging ...
IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System.
EN: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging ...
IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System.
EN: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging ...
IMG2SMI: Translating Molecular Structure Images to Simplified Molecular-input Line-entry System.
EN: Like many scientific fields, new chemistry literature has grown at a staggering pace, with thousands of papers released every month. A large portion of chemistry literature focuses on new molecules and reactions between molecules. Most vital information is conveyed through 2-D images of molecules, representing the underlying molecules or reactions described. In order to ensure reproducible and machine-readable molecule representations, text-based molecule descriptors like SMILES and SELFIES were created. These text-based molecule representations provide molecule generation but are unfortunately rarely present in published literature. In the absence of molecule descriptors, the generation of molecule descriptors from the 2-D images present in the literature is necessary to understand chemistry literature at scale. Successful methods such as Optical Structure Recognition Application (OSRA), and ChemSchematicResolver are able to extract the locations of molecules structures in chemistry papers and infer molecular descriptions and reactions. While effective, existing systems expect chemists to correct outputs, making them unsuitable for unsupervised large-scale data mining. Leveraging ...
Biomedical Data-to-Text Generation via Fine-Tuning Transformers.
EN: Data-to-text (D2T) generation in the biomedical domain is a promising - yet mostly unexplored - field of research. Here, we apply neural models for D2T generation to a real-world dataset consisting of package leaflets of European medicines. We show that fine-tuned transformers are able to generate realistic, multisentence text from data in the biomedical domain, yet have important limitations. We also release a new dataset (BioLeaflets) for benchmarking D2T generation models in the biomedical domain.
Biomedical Data-to-Text Generation via Fine-Tuning Transformers.
EN: Data-to-text (D2T) generation in the biomedical domain is a promising - yet mostly unexplored - field of research. Here, we apply neural models for D2T generation to a real-world dataset consisting of package leaflets of European medicines. We show that fine-tuned transformers are able to generate realistic, multisentence text from data in the biomedical domain, yet have important limitations. We also release a new dataset (BioLeaflets) for benchmarking D2T generation models in the biomedical domain.
Ligand-induced protein dynamics differences correlate with protein-ligand binding affinities: An unsupervised deep learning approach.
EN: Prediction of protein-ligand binding affinity is a major goal in drug discovery. Generally, free energy gap is calculated between two states (e.g., ligand binding and unbinding). The energy gap implicitly includes the effects of changes in protein dynamics induced by the binding ligand. However, the relationship between protein dynamics and binding affinity remains unclear. Here, we propose a novel method that represents protein behavioral change upon ligand binding with a simple feature that can be used to predict protein-ligand affinity. From unbiased molecular simulation data, an unsupervised deep learning method measures the differences in protein dynamics at a ligand-binding site depending on the bound ligands. A dimension-reduction method extracts a dynamic feature that is strongly correlated to the binding affinities. Moreover, the residues that play important roles in protein-ligand interactions are specified based on their contribution to the differences. These results indicate the potential for dynamics-based drug discovery.
Resting state fMRI-based temporal coherence mapping.
EN: Long-range temporal coherence (LRTC) is quite common to dynamic systems and is fundamental to the system function. LRTC in the brain has been shown to be important to cognition. Assessing LRTC may provide critical information for understanding the potential underpinnings of brain organization, function, and cognition. To facilitate this overarching goal, we provide a method, which is named temporal coherence mapping (TCM), to explicitly quantify LRTC using resting state fMRI. TCM is based on correlation analysis of the transit states of the phase space reconstructed by temporal embedding. A few TCM properties were collected to measure LRTC, including the averaged correlation, anti-correlation, the ratio of correlation and anticorrelation, the mean coherent and incoherent duration, and the ratio between the coherent and incoherent time. TCM was first evaluated with simulations and then with the large Human Connectome Project data. Evaluation results showed that TCM metrics can successfully differentiate signals with different temporal coherence regardless of the parameters used to reconstruct the phase space. In human brain, TCM metrics except the ratio of the coherent/incoherent ti...
Modeling the effect of the vaccination campaign on the Covid-19 pandemic.
EN: Population-wide vaccination is critical for containing the SARS-CoV-2 (Covid-19) pandemic when combined with restrictive and prevention measures. In this study, we introduce SAIVR, a mathematical model able to forecast the Covid-19 epidemic evolution during the vaccination campaign. SAIVR extends the widely used Susceptible-Infectious-Removed (SIR) model by considering the Asymptomatic (A) and Vaccinated (V) compartments. The model contains several parameters and initial conditions that are estimated by employing a semi-supervised machine learning procedure. After training an unsupervised neural network to solve the SAIVR differential equations, a supervised framework then estimates the optimal conditions and parameters that best fit recent infectious curves of 27 countries. Instructed by these results, we performed an extensive study on the temporal evolution of the pandemic under varying values of roll-out daily rates, vaccine efficacy, and a broad range of societal vaccine hesitancy/denial levels. The concept of herd immunity is questioned by studying future scenarios which involve different vaccination efforts and more infectious Covid-19 variants.
DeepGene Transformer: Transformer for the gene expression-based classification of cancer subtypes.
EN: Cancer and its subtypes constitute approximately 30% of all causes of death globally and display a wide range of heterogeneity in terms of clinical and molecular responses to therapy. Molecular subtyping has enabled the use of precision medicine to overcome these challenges and provide significant biological insights to predict prognosis and improve clinical decision-making. Over the past decade, conventional machine learning (ML) and deep learning (DL) algorithms have been widely espoused for the classification of cancer subtypes from gene expression datasets. However, these methods are potentially biased toward the identification of cancer biomarkers. Hence, an end-to-end deep learning approach, DeepGene Transformer, is proposed which addresses the complexity of high-dimensional gene expression with a multi-head self-attention module by identifying relevant biomarkers across multiple cancer subtypes without requiring feature selection as a pre-requisite for the current classification algorithms. Comparative analysis reveals that the proposed DeepGene Transformer outperformed the commonly used traditional and state-of-the-art classification algorithms and can be considered an effi...
Hybrid quantum-classical machine learning for generative chemistry and drug design.
EN: Deep generative chemistry models emerge as powerful tools to expedite drug discovery. However, the immense size and complexity of the structural space of all possible drug-like molecules pose significant obstacles, which could be overcome with hybrid architectures combining quantum computers with deep classical networks. As the first step toward this goal, we built a compact discrete variational autoencoder (DVAE) with a Restricted Boltzmann Machine (RBM) of reduced size in its latent layer. The size of the proposed model was small enough to fit on a state-of-the-art D-Wave quantum annealer and allowed training on a subset of the ChEMBL dataset of biologically active compounds. Finally, we generated 2331 novel chemical structures with medicinal chemistry and synthetic accessibility properties in the ranges typical for molecules from ChEMBL. The presented results demonstrate the feasibility of using already existing or soon-to-be-available quantum computing devices as testbeds for future drug discovery applications.
APObind: A Dataset of Ligand Unbound Protein Conformations for Machine Learning Applications in De Novo Drug Design.
EN: Protein-ligand complex structures have been utilised to design benchmark machine learning methods that perform important tasks related to drug design such as receptor binding site detection, small molecule docking and binding affinity prediction. However, these methods are usually trained on only ligand bound (or holo) conformations of the protein and therefore are not guaranteed to perform well when the protein structure is in its native unbound conformation (or apo), which is usually the conformation available for a newly identified receptor. A primary reason for this is that the local structure of the binding site usually changes upon ligand binding. To facilitate solutions for this problem, we propose a dataset called APObind that aims to provide apo conformations of proteins present in the PDBbind dataset, a popular dataset used in drug design. Furthermore, we explore the performance of methods specific to three use cases on this dataset, through which, the importance of validating them on the APObind dataset is demonstrated.
Neuromodulators in food ingredients: insights from network pharmacological evaluation of Ayurvedic herbs.
EN: The global burden of neurological diseases, the second leading cause of death after heart dis-eases constitutes one of the major challenges of modern medicine. Ayurveda, the traditional Indian medicinal systemenrooted in the Vedic literature and considered as a schema for the holistic management of health, characterizes various neurological diseases disorders (NDDs) and prescribes several herbs, formulations, and bio-cleansing regimes for their care and cure. In this work, we examined neuro-phytoregulatory potential of 34,472 phytochemicals among 3,038 herbs (including their varieties) mentioned in Ayurveda using network pharmacology approach and found that 45% of these Ayurvedic phytochemicals (APCs) have regulatory associations with 1,643 approved protein targets. Metabolite interconversion enzymes and protein modifying enzymes were found to be the major target classes of APCs against NDDs. The study further suggests that the actions of Ayurvedic herbs in managing NDDs were majorly via regulating signalling processes, like, G-protein signaling, acetylcholine signaling, chemokine signaling pathway and GnRH signaling. A high confidence network specific to 219 pharmaceutically relev...
Designing drug regimens that mitigate nonadherence.
EN: Medication adherence is a well-known problem for pharmaceutical treatment of chronic diseases. Understanding how nonadherence affects treatment efficacy is made difficult by the ethics of clinical trials that force patients to skip doses of the medication being tested, the unpredictable timing of missed doses by actual patients, and the many competing variables that can either mitigate or magnify the deleterious effects of nonadherence, such as pharmacokinetic absorption and elimination rates, dosing intervals, dose sizes, adherence rates, etc. In this paper, we formulate and analyze a mathematical model of the drug concentration in an imperfectly adherent patient. Our model takes the form of the standard single compartment pharmacokinetic model with first order absorption and elimination, except that the patient takes medication only at a given proportion of the prescribed dosing times. Doses are missed randomly, and we use stochastic analysis to study the resulting random drug level in the body. We then use our mathematical results to propose principles for designing drug regimens that are robust to nonadherence. In particular, we quantify the resilience of extended release drugs...
OncoPetNet: A Deep Learning based AI system for mitotic figure counting on H&E stained whole slide digital images in a large veterinary diagnostic lab setting.
EN: Background: Histopathology is an important modality for the diagnosis and management of many diseases in modern healthcare, and plays a critical role in cancer care. Pathology samples can be large and require multi-site sampling, leading to upwards of 20 slides for a single tumor, and the human-expert tasks of site selection and and quantitative assessment of mitotic figures are time consuming and subjective. Automating these tasks in the setting of a digital pathology service presents significant opportunities to improve workflow efficiency and augment human experts in practice. Approach: Multiple state-of-the-art deep learning techniques for histopathology image classification and mitotic figure detection were used in the development of OncoPetNet. Additionally, model-free approaches were used to increase speed and accuracy. The robust and scalable inference engine leverages Pytorch's performance optimizations as well as specifically developed speed up techniques in inference. Results: The proposed system, demonstrated significantly improved mitotic counting performance for 41 cancer cases across 14 cancer types compared to human expert baselines. In 21.9% of cases use of OncoPet...
The impact of contact inhibition on collective cell migration and proliferation.
EN: Contact inhibition limits migration and proliferation of cells in cell colonies. We consider a multiphase field model to investigate the growth dynamics of a cell colony, composed of proliferating cells. The model takes into account the mechanisms of contact inhibition of locomotion and proliferation by local mechanical interactions. We compare non-migrating and migrating cells, in order to provide a quantitative characterization of the dynamics and analyse the velocity of the colony boundary for both cases. Additionally, we measure single cell velocities, number of neighbour distributions, as well as the influence of stress and age on positions of the cells and with respect to each other. We further compare the findings with experimental data for Madin-Darby canine kidney cells
Leaf Recognition Using Convolutional Neural Networks Based Features.
EN: There is a warning light for the loss of plant habitats worldwide that entails concerted efforts to conserve plant biodiversity. Thus, plant species classification is of crucial importance to address this environmental challenge. In recent years, there is a considerable increase in the number of studies related to plant taxonomy. While some researchers try to improve their recognition performance using novel approaches, others concentrate on computational optimization for their framework. In addition, a few studies are diving into feature extraction to gain significantly in terms of accuracy. In this paper, we propose an effective method for the leaf recognition problem. In our proposed approach, a leaf goes through some pre-processing to extract its refined color image, vein image, xy-projection histogram, handcrafted shape, texture features, and Fourier descriptors. These attributes are then transformed into a better representation by neural network-based encoders before a support vector machine (SVM) model is utilized to classify different leaves. Overall, our approach performs a state-of-the-art result on the Flavia leaf dataset, achieving the accuracy of 99.58\% on test sets u...
CPSC: Conformal prediction with shrunken centroids for efficient prediction reliability quantification and data augmentation, a case in alternative herbal medicine classification with electronic nose.
EN: In machine learning applications, the reliability of predictions is significant for assisted decision and risk control. As an effective framework to quantify the prediction reliability, conformal prediction (CP) was developed with the CPKNN (CP with kNN). However, the conventional CPKNN suffers from high variance and bias and long computational time as the feature dimensionality increases. To address these limitations, a new CP framework-conformal prediction with shrunken centroids (CPSC) is proposed. It regularizes the class centroids to attenuate the irrelevant features and shrink the sample space for predictions and reliability quantification. To compare CPKNN and CPSC, we employed them in the classification of 12 categories of alternative herbal medicine with electronic nose as a case and assessed them in two tasks: 1) offline prediction: the training set was fixed and the accuracy on the testing set was evaluated; 2) online prediction with data augmentation: they filtered unlabeled data to augment the training data based on the prediction reliability and the final accuracy of testing set was compared. The result shows that CPSC significantly outperformed CPKNN in both two task...
Multiple species animal movements: network properties, disease dynamic and the impact of targeted control actions.
EN: Infectious diseases in livestock are well-known to infect multiple hosts and persist through the combination of within- and between-host transmission pathways. Uncertainty remains about the epidemic consequences of the disease being introduced on farms with more than one susceptible host. Here we describe multi-host contact networks to elucidate the potential of disease spread among farms with multiple species. Four years of between-farm animal movement data of bovine, swine, small ruminants, and multi-host, were described through both static and time-series networks; the in-going and out-going contact chains were also calculated. We use the proposed stochastic multilevel model to simulate scenarios in which infection was seeded into a single host and multi-hosts farms, to estimate epidemic trajectories and simulate network-based control actions to assess the reduction of secondarily infected farms. Our analysis showed that the swine network was more connected than cattle and small ruminants in the temporal network view. The small ruminants network was shown disconnected, however, allowing the interaction among networks with different hosts enabling the spread of disease throughout...
Conformer-specific Chemistry Imaged in Real Space and Time.
EN: Conformational isomers or conformers of molecules play a decisive role in chemistry and biology. However, experimental methods to investigate chemical reaction dynamics are typically not conformer-sensitive. Here, we report on a gas-phase megaelectronvolt ultrafast electron diffraction investigation of α-phellandrene undergoing an electrocyclic ring-opening reaction. We directly image the evolution of a specific set of α-phellandrene conformers into the product isomer predicted by the Woodward-Hoffmann rules in real space and time. Our experimental results are in quantitative agreement with nonadiabatic quantum molecular dynamics simulations, which provide unprecedented detail of how conformation influences time scale and quantum efficiency of photoinduced ring-opening reactions. Due to the prevalence of large numbers of conformers in organic chemistry, our findings impact our general understanding of reaction dynamics in chemistry and biology.
Conformer-specific Chemistry Imaged in Real Space and Time.
EN: Conformational isomers or conformers of molecules play a decisive role in chemistry and biology. However, experimental methods to investigate chemical reaction dynamics are typically not conformer-sensitive. Here, we report on a gas-phase megaelectronvolt ultrafast electron diffraction investigation of α-phellandrene undergoing an electrocyclic ring-opening reaction. We directly image the evolution of a specific set of α-phellandrene conformers into the product isomer predicted by the Woodward-Hoffmann rules in real space and time. Our experimental results are in quantitative agreement with nonadiabatic quantum molecular dynamics simulations, which provide unprecedented detail of how conformation influences time scale and quantum efficiency of photoinduced ring-opening reactions. Due to the prevalence of large numbers of conformers in organic chemistry, our findings impact our general understanding of reaction dynamics in chemistry and biology.
Embedding digital chronotherapy into medical devices -- A canine validation for controlling status epilepticus through multi-scale rhythmic brain stimulation.
EN: Circadian and other physiological rhythms play a key role in both normal homeostasis and disease processes. Such is the case of circadian and infradian seizure patterns observed in epilepsy. In this paper we explore a new implantable stimulator that implements chronotherapy as a feedforward input to supplement both open-loop and closed-loop methods. This integrated algorithm allows for stimulation to be adjusted to the ultradian, circadian and infradian patterns observed in patients through slowly-varying temporal adjustments of stimulation and algorithm sub-components, while also enabling adaption of stimulation based on immediate physiological needs such as a breakthrough seizure or change of posture. Embedded physiological sensors in the stimulator can be used to refine the baseline stimulation circadian pattern as a "digital zeitgeber". This approach is tested on a canine with severe drug-resistant idiopathic generalized epilepsy exhibiting a diurnal pattern correlated with sleep-wake cycles. Prior to implantation, the canine's cluster seizures evolved to status epilepticus (SE) and required emergency pharmacological intervention. The cranially-mounted system was fully-implante...
Parasitic Egg Detection and Classification in Low-cost Microscopic Images using Transfer Learning.
EN: Intestinal parasitic infection leads to several morbidities to humans worldwide, especially in tropical countries. The traditional diagnosis usually relies on manual analysis from microscopic images which is prone to human error due to morphological similarity of different parasitic eggs and abundance of impurities in a sample. Many studies have developed automatic systems for parasite egg detection to reduce human workload. However, they work with high quality microscopes, which unfortunately remain unaffordable in some rural areas. Our work thus exploits a benefit of a low-cost USB microscope. This instrument however provides poor quality of images due to limitation of magnification (10x), causing difficulty in parasite detection and species classification. In this paper, we propose a CNN-based technique using transfer learning strategy to enhance the efficiency of automatic parasite classification in poor-quality microscopic images. The patch-based technique with sliding window is employed to search for location of the eggs. Two networks, AlexNet and ResNet50, are examined with a trade-off between architecture size and classification performance. The results show that our propos...
Ultrasonic chaining of emulsion droplets.
EN: Emulsion droplets trapped in an ultrasonic levitator behave in two ways that solid spheres do not: (1) Individual droplets spin rapidly about an axis parallel to the trapping plane, and (2) coaxially spinning droplets form long chains aligned with their common axis of rotation. Acoustically-organized chains interact hydrodynamically, either to merge into longer chains or to form three-dimensional bundles of chains. Solid spheres, by contrast, form close-packed planar crystals drawn together by the sound-mediated secondary Bjerknes interaction. We demonstrate the chain-to-crystal transition with a model system in which fluid emulsion droplets can be photopolymerized into solid spheres without significantly changing other material properties. The behavior of this experimental system is quantitatively consistent with an acoustohydrodynamic model for spinning spheres in an acoustic levitator. This study therefore introduces acoustically-driven spinning as a mechanism for guiding self-organization of acoustically levitated matter.
Bimolecular chemistry in the ultracold regime.
EN: Advances in atomic, molecular, and optical (AMO) physics techniques allowed the cooling of simple molecules down to the ultracold regime ($\lesssim$ 1 mK), and opened the opportunities to study chemical reactions with unprecedented levels of control. This review covers recent developments in studying bimolecular chemistry at ultralow temperatures. We begin with a brief overview of methods for producing, manipulating, and detecting ultracold molecules. We then survey experimental works that exploit the controllability of ultracold molecules to probe and modify their long-range interactions. Further combining the use of physical chemistry techniques, such as mass spectrometry and ion imaging, significantly improved the detection of ultracold reactions and enabled explorations of their dynamics in the short-range. We discuss a series of studies on the reaction KRb + KRb $\rightarrow$ K$_2$ + Rb$_2$ initiated below 1 $μ$K, including the direct observation of a long-lived complex, the demonstration of product rotational state control via conserved nuclear spins, and a test of the statistical model using the complete quantum state distribution of the products.
Bimolecular chemistry in the ultracold regime.
EN: Advances in atomic, molecular, and optical (AMO) physics techniques allowed the cooling of simple molecules down to the ultracold regime ($\lesssim$ 1 mK), and opened the opportunities to study chemical reactions with unprecedented levels of control. This review covers recent developments in studying bimolecular chemistry at ultralow temperatures. We begin with a brief overview of methods for producing, manipulating, and detecting ultracold molecules. We then survey experimental works that exploit the controllability of ultracold molecules to probe and modify their long-range interactions. Further combining the use of physical chemistry techniques, such as mass spectrometry and ion imaging, significantly improved the detection of ultracold reactions and enabled explorations of their dynamics in the short-range. We discuss a series of studies on the reaction KRb + KRb $\rightarrow$ K$_2$ + Rb$_2$ initiated below 1 $μ$K, including the direct observation of a long-lived complex, the demonstration of product rotational state control via conserved nuclear spins, and a test of the statistical model using the complete quantum state distribution of the products.
Bimolecular chemistry in the ultracold regime.
EN: Advances in atomic, molecular, and optical (AMO) physics techniques allowed the cooling of simple molecules down to the ultracold regime ($\lesssim$ 1 mK), and opened the opportunities to study chemical reactions with unprecedented levels of control. This review covers recent developments in studying bimolecular chemistry at ultralow temperatures. We begin with a brief overview of methods for producing, manipulating, and detecting ultracold molecules. We then survey experimental works that exploit the controllability of ultracold molecules to probe and modify their long-range interactions. Further combining the use of physical chemistry techniques, such as mass spectrometry and ion imaging, significantly improved the detection of ultracold reactions and enabled explorations of their dynamics in the short-range. We discuss a series of studies on the reaction KRb + KRb $\rightarrow$ K$_2$ + Rb$_2$ initiated below 1 $μ$K, including the direct observation of a long-lived complex, the demonstration of product rotational state control via conserved nuclear spins, and a test of the statistical model using the complete quantum state distribution of the products.
Molecular Chemistry for Dark Matter.
EN: Molecular cooling is essential for studying the formation of sub-structure of dissipative dark-matter halos that may host compact objects such as black holes. Here, we analyze the reaction rates relevant for the formation, dissociation, and transition of hydrogenic molecules while allowing for different values of the physical parameters: the coupling constant, the proton mass, and the electron mass. For all cases, we re-scale the reaction rates for the standard molecular hydrogen, so our results are valid as long as the dark matter is weakly coupled and one of the fermions is much heavier than the other. These results will allow a robust numerical treatment of cosmic structure, in particular for mini-halos for which molecular cooling is important, in a dissipative dark matter scenario.
Molecular Chemistry for Dark Matter.
EN: Molecular cooling is essential for studying the formation of sub-structure of dissipative dark-matter halos that may host compact objects such as black holes. Here, we analyze the reaction rates relevant for the formation, dissociation, and transition of hydrogenic molecules while allowing for different values of the physical parameters: the coupling constant, the proton mass, and the electron mass. For all cases, we re-scale the reaction rates for the standard molecular hydrogen, so our results are valid as long as the dark matter is weakly coupled and one of the fermions is much heavier than the other. These results will allow a robust numerical treatment of cosmic structure, in particular for mini-halos for which molecular cooling is important, in a dissipative dark matter scenario.
Molecular Chemistry for Dark Matter.
EN: Molecular cooling is essential for studying the formation of sub-structure of dissipative dark-matter halos that may host compact objects such as black holes. Here, we analyze the reaction rates relevant for the formation, dissociation, and transition of hydrogenic molecules while allowing for different values of the physical parameters: the coupling constant, the proton mass, and the electron mass. For all cases, we re-scale the reaction rates for the standard molecular hydrogen, so our results are valid as long as the dark matter is weakly coupled and one of the fermions is much heavier than the other. These results will allow a robust numerical treatment of cosmic structure, in particular for mini-halos for which molecular cooling is important, in a dissipative dark matter scenario.
Stringiness of Hyaluronic Acid Emulsions.
EN: In this work, we underline the importance of the molecular weight of hyaluronic acid on the elongational properties of concentrated emulsions. The filament formation properties, e.g. the stringiness, of an emulsion is a key determinant of a product liking and repeat purchase. Here, we find that high molecular weight hyaluronic acid and a high stretching speed are the control parameters affecting the filament formation of an emulsion.
Creep and drainage in the fast destabilization of emulsions Creep and drainage in the fast destabilization of emulsions.
EN: The destabilization of emulsions is important for many applications but remains incompletely understood. We perform squeeze flow measurements on oil-in-water emulsions, finding that the spontaneous destabilization of emulsions is generally very slow under normal conditions, with a characteristic time scale given by the drainage of the continuous phase and the coalescence of the dispersed phase. We show that if the emulsion is compressed between two plates, the destabilization can be sped up significantly; on the one hand, the drainage is faster due to the application of the squeezing force. On the other hand, creep processes lead to rearrangements that also contribute to the destabilization.
Emulsion Destabilization by Squeeze Flow.
EN: There is a large debate on the destabilization mechanism of emulsions. We present a simple technique using mechanical compression to destabilize oil-in-water emulsions. Upon compression of the emulsion, the continuous aqueous phase is squeezed out, while the dispersed oil phase progressively deforms from circular to honeycomb-like shapes. The films that separate the oil droplets are observed to thin and break at a critical oil/water ratio, leading to coalescence events. Electrostatic interactions and local droplet rearrangements do not determine film rupture. Instead, the destabilization occurs like an avalanche propagating through the system, starting at areas where the film thickness is smallest.
Biomedical Interpretable Entity Representations.
EN: Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label ...
Biomedical Interpretable Entity Representations.
EN: Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label ...
COVID-19 Vaccine Misinformation Campaigns and Social Media Narratives.
EN: COVID-19 vaccine hesitancy has increased concerns about vaccine uptake required to overcome the pandemic and protect public health. A critical factor associated with anti-vaccine attitudes is the information shared on social media. In this work, we investigate misinformation communities and narratives that can contribute to COVID-19 vaccine hesitancy. During the pandemic, anti-science and political misinformation/conspiracies have been rampant on social media. Therefore, we investigate misinformation and conspiracy groups and their characteristic behaviours in Twitter data collected on COVID-19 vaccines. We identify if any suspicious coordinated efforts are present in promoting vaccine misinformation, and find two suspicious groups - one promoting a 'Great Reset' conspiracy which suggests that the pandemic is orchestrated by world leaders to take control of the economy, with vaccine related misinformation and strong anti-vaccine and anti-social messages such as no lock-downs; and another promoting the Bioweapon theory. Misinformation promoted is largely from the anti-vaccine and far-right communities in the 3-core of the retweet graph, with its tweets proportion of conspiracy and q...
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark.
EN: Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.
CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark.
EN: Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.
Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual Screening.
EN: We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01\% of detecting the underlying best scoring 0.1\% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for ...
Locally Sparse Neural Networks for Tabular Biomedical Data.
EN: Tabular datasets with low-sample-size or many variables are prevalent in biomedicine. Practitioners in this domain prefer linear or tree-based models over neural networks since the latter are harder to interpret and tend to overfit when applied to tabular datasets. To address these neural networks' shortcomings, we propose an intrinsically interpretable network for heterogeneous biomedical data. We design a locally sparse neural network where the local sparsity is learned to identify the subset of most relevant features for each sample. This sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By forcing the model to select a subset of the most informative features for each sample, we reduce model overfitting in low-sample-size data and obtain an interpretable model. We demonstrate that our method outperforms state-of-the-art models when applied to synthetic or real-world biomedical datasets using extensive experiments. Furthermore, the proposed framework dramatically outperforms existing schemes when evaluating its interpretability capabilities. Finally, we demonstrate the applicability of our model t...
Locally Sparse Neural Networks for Tabular Biomedical Data.
EN: Tabular datasets with low-sample-size or many variables are prevalent in biomedicine. Practitioners in this domain prefer linear or tree-based models over neural networks since the latter are harder to interpret and tend to overfit when applied to tabular datasets. To address these neural networks' shortcomings, we propose an intrinsically interpretable network for heterogeneous biomedical data. We design a locally sparse neural network where the local sparsity is learned to identify the subset of most relevant features for each sample. This sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By forcing the model to select a subset of the most informative features for each sample, we reduce model overfitting in low-sample-size data and obtain an interpretable model. We demonstrate that our method outperforms state-of-the-art models when applied to synthetic or real-world biomedical datasets using extensive experiments. Furthermore, the proposed framework dramatically outperforms existing schemes when evaluating its interpretability capabilities. Finally, we demonstrate the applicability of our model t...
Virtual Screening of Pharmaceutical Compounds with hERG Inhibitory Activity (Cardiotoxicity) using Ensemble Learning.
EN: In silico prediction of cardiotoxicity with high sensitivity and specificity for potential drug molecules can be of immense value. Hence, building machine learning classification models, based on some features extracted from the molecular structure of drugs, which are capable of efficiently predicting cardiotoxicity is critical. In this paper, we consider the application of various machine learning approaches, and then propose an ensemble classifier for the prediction of molecular activity on a Drug Discovery Hackathon (DDH) (1st reference) dataset. We have used only 2-D descriptors of SMILE notations for our prediction. Our ensemble classification uses 5 classifiers (2 Random Forest Classifiers, 2 Support Vector Machines and a Dense Neural Network) and uses Max-Voting technique and Weighted-Average technique for final decision.
Computer-Assisted Analysis of Biomedical Images.
EN: Nowadays, the amount of heterogeneous biomedical data is increasing more and more thanks to novel sensing techniques and high-throughput technologies. In reference to biomedical image analysis, the advances in image acquisition modalities and high-throughput imaging experiments are creating new challenges. This huge information ensemble could overwhelm the analytic capabilities needed by physicians in their daily decision-making tasks as well as by biologists investigating complex biochemical systems. In particular, quantitative imaging methods convey scientifically and clinically relevant information in prediction, prognosis or treatment response assessment, by also considering radiomics approaches. Therefore, the computational analysis of medical and biological images plays a key role in radiology and laboratory applications. In this regard, frameworks based on advanced Machine Learning and Computational Intelligence can significantly improve traditional Image Processing and Pattern Recognition approaches. However, conventional Artificial Intelligence techniques must be tailored to address the unique challenges concerning biomedical imaging data. This thesis aims at proposing nov...
Computer-Assisted Analysis of Biomedical Images.
EN: Nowadays, the amount of heterogeneous biomedical data is increasing more and more thanks to novel sensing techniques and high-throughput technologies. In reference to biomedical image analysis, the advances in image acquisition modalities and high-throughput imaging experiments are creating new challenges. This huge information ensemble could overwhelm the analytic capabilities needed by physicians in their daily decision-making tasks as well as by biologists investigating complex biochemical systems. In particular, quantitative imaging methods convey scientifically and clinically relevant information in prediction, prognosis or treatment response assessment, by also considering radiomics approaches. Therefore, the computational analysis of medical and biological images plays a key role in radiology and laboratory applications. In this regard, frameworks based on advanced Machine Learning and Computational Intelligence can significantly improve traditional Image Processing and Pattern Recognition approaches. However, conventional Artificial Intelligence techniques must be tailored to address the unique challenges concerning biomedical imaging data. This thesis aims at proposing nov...
Informing Geometric Deep Learning with Electronic Interactions to Accelerate Quantum Chemistry.
EN: Predicting electronic energies, densities, and related chemical properties can facilitate the discovery of novel catalysts, medicines, and battery materials. By developing a physics-inspired equivariant neural network, we introduce a method to learn molecular representations based on the electronic interactions among atomic orbitals. Our method, OrbNet-Equi, leverages efficient tight-binding simulations and learned mappings to recover high fidelity quantum chemical properties. OrbNet-Equi models a wide spectrum of target properties with an accuracy consistently better than standard machine learning methods and a speed orders of magnitude greater than density functional theory. Despite only using training samples collected from readily available small-molecule libraries, OrbNet-Equi outperforms traditional methods on comprehensive downstream benchmarks that encompass diverse main-group chemical processes. Our method also describes interactions in challenging charge-transfer complexes and open-shell systems. We anticipate that the strategy presented here will help to expand opportunities for studies in chemistry and materials science, where the acquisition of experimental or referenc...
SciFive: a text-to-text transformer model for biomedical literature.
EN: In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area
SciFive: a text-to-text transformer model for biomedical literature.
EN: In this report, we introduce SciFive, a domain-specific T5 model that has been pre-trained on large biomedical corpora. Our model outperforms the current SOTA methods (i.e. BERT, BioBERT, Base T5) on tasks in named entity relation, relation extraction, natural language inference, and question-answering. We show that text-generation methods have significant potential in a broad array of biomedical NLP tasks, particularly those requiring longer, more complex outputs. Our results support the exploration of more difficult text generation tasks and the development of new methods in this area
Optimizing the location of vaccination sites to stop a zoonotic epidemic.
EN: The mainstay of canine rabies control is fixed point mass dog vaccination campaigns (MDVC). However, in some regions, ideal vaccination coverage in dogs is not obtained due to low participation in the MDVC. Travel distance to the vaccination sites has been identified as an important barrier to participation. We aim to increase MDVC participation by optimally placing fixed point vaccination locations to minimize walking distance to the nearest vaccination location. We quantified participation probability based on walking distance to the nearest vaccination point using a Poisson regression model. The regression was fit with survey data collected from 2016-2019. We then used a computational recursive interchange technique to solve the facility location problem to find a set of optimal placements of fixed point vaccination locations. Finally, we compared predicted participation of optimally placed vaccination sites to historical participation data from surveys collected from 2016-2019. We identified the p-median algorithm to solve the facility location problem as ideal for fixed point vaccination placement. We found a predicted increase in MDVC participation if vaccination locations ar...
Emotion Recognition in Horses with Convolutional Neural Networks.
EN: Creating intelligent systems capable of recognizing emotions is a difficult task, especially when looking at emotions in animals. This paper describes the process of designing a "proof of concept" system to recognize emotions in horses. This system is formed by two elements, a detector and a model. The detector is a fast region-based convolutional neural network that detects horses in an image. The model is a convolutional neural network that predicts the emotions of those horses. These two elements were trained with multiple images of horses until they achieved high accuracy in their tasks. In total, 400 images of horses were collected and labeled to train both the detector and the model while 40 were used to test the system. Once the two components were validated, they were combined into a testable system that would detect equine emotions based on established behavioral ethograms indicating emotional affect through head, neck, ear, muzzle and eye position. The system showed an accuracy of 80% on the validation set and 65% on the test set, demonstrating that it is possible to predict emotions in animals using autonomous intelligent systems. Such a system has multiple applications ...
Out-of-Distribution Detection in Dermatology using Input Perturbation and Subset Scanning.
EN: Recent advances in deep learning have led to breakthroughs in the development of automated skin disease classification. As we observe an increasing interest in these models in the dermatology space, it is crucial to address aspects such as the robustness towards input data distribution shifts. Current skin disease models could make incorrect inferences for test samples from different hardware devices and clinical settings or unknown disease samples, which are out-of-distribution (OOD) from the training samples. To this end, we propose a simple yet effective approach that detect these OOD samples prior to making any decision. The detection is performed via scanning in the latent space representation (e.g., activations of the inner layers of any pre-trained skin disease classifier). The input samples could also perturbed to maximise divergence of OOD samples. We validate our ODD detection approach in two use cases: 1) identify samples collected from different protocols, and 2) detect samples from unknown disease classes. Additionally, we evaluate the performance of the proposed approach and compare it with other state-of-the-art methods. Furthermore, data-driven dermatology applicati...
Towards Realization of Augmented Intelligence in Dermatology: Advances and Future Directions.
EN: Artificial intelligence (AI) algorithms using deep learning have advanced the classification of skin disease images; however these algorithms have been mostly applied "in silico" and not validated clinically. Most dermatology AI algorithms perform binary classification tasks (e.g. malignancy versus benign lesions), but this task is not representative of dermatologists' diagnostic range. The American Academy of Dermatology Task Force on Augmented Intelligence published a position statement emphasizing the importance of clinical validation to create human-computer synergy, termed augmented intelligence (AuI). Liu et al's recent paper, "A deep learning system for differential diagnosis of skin diseases" represents a significant advancement of AI in dermatology, bringing it closer to clinical impact. However, significant issues must be addressed before this algorithm can be integrated into clinical workflow. These issues include accurate and equitable model development, defining and assessing appropriate clinical outcomes, and real-world integration.
Statistical Image Analysis of Drying Bovine Serum Albumin Droplets in Phosphate Buffered Saline.
EN: A bio-colloidal drying droplet can be used as a pre-diagnostic technique. However, a successful clinical setting requires a fundamental understanding of the final morphology and the way it is related to the initial state of the constituents present in the droplet. This chapter focuses on the physics associated with different pattern formations in the globular protein, bovine serum albumin (BSA) at different phosphate-buffered saline concentrations. The study reports that the first-order statistics (FOS) and the gray level co-occurrence matrix (GLCM) analysis are capable of capturing structural changes of the droplets. While the FOS of the image depends on the individual pixels, the GLCM summarizes both tonal and structural relationships between the neighboring pixels. The horizontal and the vertical orientations of the GLCM parameters show a non-significant effect when the pixel displacement is $\leq$ 1. Interestingly, two local equilibrium-like regions (the rim and the central regions) appear when these droplets approach the steady-state. The bimodal distribution confirms that the BSA-BSA interactions are dominant (recessive) over the BSA-saline interactions in the rim (central) r...
Fish Disease Detection Using Image Based Machine Learning Technique in Aquaculture.
EN: Fish diseases in aquaculture constitute a significant hazard to nutriment security. Identification of infected fishes in aquaculture remains challenging to find out at the early stage due to the dearth of necessary infrastructure. The identification of infected fish timely is an obligatory step to thwart from spreading disease. In this work, we want to find out the salmon fish disease in aquaculture, as salmon aquaculture is the fastest-growing food production system globally, accounting for 70 percent (2.5 million tons) of the market. In the alliance of flawless image processing and machine learning mechanism, we identify the infected fishes caused by the various pathogen. This work divides into two portions. In the rudimentary portion, image pre-processing and segmentation have been applied to reduce noise and exaggerate the image, respectively. In the second portion, we extract the involved features to classify the diseases with the help of the Support Vector Machine (SVM) algorithm of machine learning with a kernel function. The processed images of the first portion have passed through this (SVM) model. Then we harmonize a comprehensive experiment with the proposed combination ...
HamNet: Conformation-Guided Molecular Representation with Hamiltonian Neural Networks.
EN: Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation- & rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark.
HamNet: Conformation-Guided Molecular Representation with Hamiltonian Neural Networks.
EN: Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation- & rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark.
HamNet: Conformation-Guided Molecular Representation with Hamiltonian Neural Networks.
EN: Well-designed molecular representations (fingerprints) are vital to combine medical chemistry and deep learning. Whereas incorporating 3D geometry of molecules (i.e. conformations) in their representations seems beneficial, current 3D algorithms are still in infancy. In this paper, we propose a novel molecular representation algorithm which preserves 3D conformations of molecules with a Molecular Hamiltonian Network (HamNet). In HamNet, implicit positions and momentums of atoms in a molecule interact in the Hamiltonian Engine following the discretized Hamiltonian equations. These implicit coordinations are supervised with real conformations with translation- & rotation-invariant losses, and further used as inputs to the Fingerprint Generator, a message-passing neural network. Experiments show that the Hamiltonian Engine can well preserve molecular conformations, and that the fingerprints generated by HamNet achieve state-of-the-art performances on MoleculeNet, a standard molecular machine learning benchmark.
MEGADOCK-GUI: a GUI-based complete cross-docking tool for exploring protein-protein interactions.
EN: Information on protein-protein interactions (PPIs) not only advances our understanding of molecular biology but also provides important clues for target selection in drug discovery and the design of PPI inhibitors. One of the techniques used for computational prediction of PPIs is protein-protein docking calculations, and a variety of software has been developed. However, a friendly interface for users who are not sufficiently familiar with the command line interface has not been developed so far. In this study, we have developed a graphical user interface, MEGADOCK-GUI, which enables users to easily predict PPIs and protein complex structures. In addition to the original 3-D molecular viewer and input file preparation functions, MEGADOCK-GUI is software that can automatically perform complete cross-docking of $M$ vs. $N$ proteins. With MEGADOCK-GUI, various applications related to the prediction of PPIs, such as ensemble docking that handles multiple conformations of proteins and screening of binding partner proteins that bind to specific proteins, can now be easily performed.
Long-range atom-ion Rydberg molecule: A novel molecular binding mechanism.
EN: We present a novel binding mechanism where a neutral Rydberg atom and an atomic ion form a molecular bound state at large internuclear distance. The binding mechanism is based on Stark shifts and level crossings which are induced in the Rydberg atom due to the electric field of the ion. At particular internuclear distances between Rydberg atom and ion, potential wells occur which can hold atom-ion molecular bound states. Apart from the binding mechanism we describe important properties of the long-range atom-ion Rydberg molecule, such as its lifetime and decay paths, its vibrational and rotational structure, and its large dipole moment. Furthermore, we discuss methods how to produce and detect it. The unusual properties of the long-range atom-ion Rydberg molecule give rise to interesting prospects for studies of wave packet dynamics in engineered potential energy landscapes.
ResAtom System: Protein and Ligand Affinity Prediction Model Based on Deep Learning.
EN: Motivation: Protein-ligand affinity prediction is an important part of structure-based drug design. It includes molecular docking and affinity prediction. Although molecular dynamics can predict affinity with high accuracy at present, it is not suitable for large-scale virtual screening. The existing affinity prediction and evaluation functions based on deep learning mostly rely on experimentally-determined conformations. Results: We build a predictive model of protein-ligand affinity through the ResNet neural network with added attention mechanism. The resulting ResAtom-Score model achieves Pearson's correlation coefficient R = 0.833 on the CASF-2016 benchmark test set. At the same time, we evaluated the performance of a variety of existing scoring functions in combination with ResAtom-Score in the absence of experimentally-determined conformations. The results show that the use of ΔVinaRF20 in combination with ResAtom-Score can achieve affinity prediction close to scoring functions in the presence of experimentally-determined conformations. These results suggest that ResAtom system may be used for in silico screening of small molecule ligands with target proteins in the future. A...
AMMU : A Survey of Transformer-based Biomedical Pretrained Language Models.
EN: Transformer-based pretrained language models (PLMs) have started a new era in modern natural language processing (NLP). These models combine the power of transformers, transfer learning, and self-supervised learning (SSL). Following the success of these models in the general domain, the biomedical research community has developed various in-domain PLMs starting from BioBERT to the latest BioELECTRA and BioALBERT models. We strongly believe there is a need for a survey paper that can provide a comprehensive survey of various transformer-based biomedical pretrained language models (BPLMs). In this survey, we start with a brief overview of foundational concepts like self-supervised learning, embedding layer and transformer encoder layers. We discuss core concepts of transformer-based PLMs like pretraining methods, pretraining tasks, fine-tuning methods, and various embedding types specific to biomedical domain. We introduce a taxonomy for transformer-based BPLMs and then discuss all the models. We discuss various challenges and present possible solutions. We conclude by highlighting some of the open issues which will drive the research community to further improve transformer-based BP...
AMMU : A Survey of Transformer-based Biomedical Pretrained Language Models.
EN: Transformer-based pretrained language models (PLMs) have started a new era in modern natural language processing (NLP). These models combine the power of transformers, transfer learning, and self-supervised learning (SSL). Following the success of these models in the general domain, the biomedical research community has developed various in-domain PLMs starting from BioBERT to the latest BioELECTRA and BioALBERT models. We strongly believe there is a need for a survey paper that can provide a comprehensive survey of various transformer-based biomedical pretrained language models (BPLMs). In this survey, we start with a brief overview of foundational concepts like self-supervised learning, embedding layer and transformer encoder layers. We discuss core concepts of transformer-based PLMs like pretraining methods, pretraining tasks, fine-tuning methods, and various embedding types specific to biomedical domain. We introduce a taxonomy for transformer-based BPLMs and then discuss all the models. We discuss various challenges and present possible solutions. We conclude by highlighting some of the open issues which will drive the research community to further improve transformer-based BP...
Accurate Prediction of Free Solvation Energy of Organic Molecules via Graph Attention Network and Message Passing Neural Network from Pairwise Atomistic Interactions.
EN: Deep learning based methods have been widely applied to predict various kinds of molecular properties in the pharmaceutical industry with increasingly more success. Solvation free energy is an important index in the field of organic synthesis, medicinal chemistry, drug delivery, and biological processes. However, accurate solvation free energy determination is a time-consuming experimental process. Furthermore, it could be useful to assess solvation free energy in the absence of a physical sample. In this study, we propose two novel models for the problem of free solvation energy predictions, based on the Graph Neural Network (GNN) architectures: Message Passing Neural Network (MPNN) and Graph Attention Network (GAT). GNNs are capable of summarizing the predictive information of a molecule as low-dimensional features directly from its graph structure without relying on an extensive amount of intra-molecular descriptors. As a result, these models are capable of making accurate predictions of the molecular properties without the time consuming process of running an experiment on each molecule. We show that our proposed models outperform all quantum mechanical and molecular dynamics m...
Classifying herbal medicine origins by temporal and spectral data mining of electronic nose.
EN: The origins of herbal medicines are important for their treatment effect, which could be potentially distinguished by electronic nose system. As the odor fingerprint of herbal medicines from different origins can be tiny, the discrimination of origins can be much harder than that of different categories. Better feature extraction methods are significant for this task to be more accurately done, but there lacks systematic studies on different feature extraction methods. In this study, we classified different origins of three categories of herbal medicines with different feature extraction methods: manual feature extraction, mathematical transformation, deep learning algorithms. With 50 repetitive experiments with bootstrapping, we compared the effectiveness of the extractions with a two-layer neural network w/o dimensionality reduction methods (principal component analysis, linear discriminant analysis) as the three base classifiers. Compared with the conventional aggregated features, the Fast Fourier Transform method and our novel approach (longitudinal-information-in-a-line) showed an significant accuracy improvement(p < 0.05) on all 3 base classifiers and all three herbal medicin...
Heated gas bubbles enrich, crystallize, dry, phosphorylate and encapsulate prebiotic molecules.
EN: Non-equilibrium conditions must have been crucial for the assembly of the first informational polymers of early life, but supporting their formation and continuous enrichment in a long-lasting environment. Here, we explore how gas bubbles in water subjected to a thermal gradient, a likely scenario within crustal mafic rocks on the early Earth, drive a complex, continuous enrichment of prebiotic molecules. NRA precursors, monomers, active ribozymes, oligonucleotides and lipids are shown to (1) cycle between dry and wet states, enabling the central step of RNA phosphorylation, (2) accumulate at the gas-water interface to drastically increase ribozymatic activity, (3) condense into hydrogels, (4) form pure crystals and (5) encapsulate into protecting vesicle aggregates that subsequently undergo fission. These effects occur within less than 30 min. The findings unite, in one location, the physical conditions that were crucial for the chemical emergence of biopolymers. They suggest that heated microbubbles could have hosted the first cycles of molecular evolution.
Shear dynamics of polydisperse double emulsions.
EN: We numerically study the dynamics of a polydisperse double emulsion under a symmetric shear flow. We show that both dispersity and shear rate crucially affect the behavior of the innermost drops and of the surrounding shell. While at low/moderate values of shear rates the inner drops rotate periodically around a common center of mass triggered by the fluid vortex formed within the emulsion generally regardless of their polydispersity, at higher values such dynamics occurs only at increasing polydispersity, since monodisperse drops are found to align along the shear flow and become approximately motionless at late times. Our simulations also suggest that increasing polydispersity favours close-range contacts among cores and persistent collisions, while hindering shape deformations of the external droplet. A quantitative evaluation of these effects is also provided.
Using Molecular Embeddings in QSAR Modeling: Does it Make a Difference?.
EN: With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinders the process of choosing a suitable representation for QSAR modeling. A reason behind this issue is the difficulty of conducting a fair and thorough comparison of the different existing embedding approaches, which requires numerous experiments on various datasets and training scenarios. To close this gap, we reviewed the literature on methods for molecular embeddings and reproduced three unsupervised and two supervised molecular embedding techniques recently proposed in the literature. We compared these five methods concerning their performance in QSAR scenarios using different classification and regression datasets. We also compared these representations to traditional molecular representations, namely molecular descriptors and fingerprints. As opposed to the expected outcome, our exper...
WheatNet: A Lightweight Convolutional Neural Network for High-throughput Image-based Wheat Head Detection and Counting.
EN: For a globally recognized planting breeding organization, manually-recorded field observation data is crucial for plant breeding decision making. However, certain phenotypic traits such as plant color, height, kernel counts, etc. can only be collected during a specific time-window of a crop's growth cycle. Due to labor-intensive requirements, only a small subset of possible field observations are recorded each season. To help mitigate this data collection bottleneck in wheat breeding, we propose a novel deep learning framework to accurately and efficiently count wheat heads to aid in the gathering of real-time data for decision making. We call our model WheatNet and show that our approach is robust and accurate for a wide range of environmental conditions of the wheat field. WheatNet uses a truncated MobileNetV2 as a lightweight backbone feature extractor which merges feature maps with different scales to counter image scale variations. Then, extracted multi-scale features go to two parallel sub-networks for simultaneous density-based counting and localization tasks. Our proposed method achieves an MAE and RMSE of 3.85 and 5.19 in our wheat head counting task, respectively, while h...
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation.
EN: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
Lightweight Combinational Machine Learning Algorithm for Sorting Canine Torso Radiographs.
EN: The veterinary field lacks automation in contrast to the tremendous technological advances made in the human medical field. Implementation of machine learning technology can shorten any step of the automation process. This paper explores these core concepts and starts with automation in sorting radiographs for canines by view and anatomy. This is achieved by developing a new lightweight algorithm inspired by AlexNet, Inception, and SqueezeNet. The proposed module proves to be lighter than SqueezeNet while maintaining accuracy higher than that of AlexNet, ResNet, DenseNet, and SqueezeNet.
Graph Energy-based Model for Substructure Preserving Molecular Design.
EN: It is common practice for chemists to search chemical databases based on substructures of compounds for finding molecules with desired properties. The purpose of de novo molecular generation is to generate instead of search. Existing machine learning based molecular design methods have no or limited ability in generating novel molecules that preserves a target substructure. Our Graph Energy-based Model, or GEM, can fix substructures and generate the rest. The experimental results show that the GEMs trained from chemistry datasets successfully generate novel molecules while preserving the target substructures. This method would provide a new way of incorporating the domain knowledge of chemists in molecular design.
Graph Energy-based Model for Substructure Preserving Molecular Design.
EN: It is common practice for chemists to search chemical databases based on substructures of compounds for finding molecules with desired properties. The purpose of de novo molecular generation is to generate instead of search. Existing machine learning based molecular design methods have no or limited ability in generating novel molecules that preserves a target substructure. Our Graph Energy-based Model, or GEM, can fix substructures and generate the rest. The experimental results show that the GEMs trained from chemistry datasets successfully generate novel molecules while preserving the target substructures. This method would provide a new way of incorporating the domain knowledge of chemists in molecular design.
Graph Energy-based Model for Substructure Preserving Molecular Design.
EN: It is common practice for chemists to search chemical databases based on substructures of compounds for finding molecules with desired properties. The purpose of de novo molecular generation is to generate instead of search. Existing machine learning based molecular design methods have no or limited ability in generating novel molecules that preserves a target substructure. Our Graph Energy-based Model, or GEM, can fix substructures and generate the rest. The experimental results show that the GEMs trained from chemistry datasets successfully generate novel molecules while preserving the target substructures. This method would provide a new way of incorporating the domain knowledge of chemists in molecular design.
Boost AI Power: Data Augmentation Strategies with unlabelled Data and Conformal Prediction, a Case in Alternative Herbal Medicine Discrimination with Electronic Nose.
EN: Electronic nose has been proven to be effective in alternative herbal medicine classification, but due to the nature of supervised learning, previous research heavily relies on the labelled training data, which are time-costly and labor-intensive to collect. To alleviate the critical dependency on the training data in real-world applications, this study aims to improve classification accuracy via data augmentation strategies. The effectiveness of five data augmentation strategies under different training data inadequacy are investigated in two scenarios: the noise-free scenario where different availabilities of unlabelled data were considered, and the noisy scenario where different levels of Gaussian noises and translational shifts were added to represent sensor drifts. The five augmentation strategies, namely noise-adding data augmentation, semi-supervised learning, classifier-based online learning, Inductive Conformal Prediction (ICP) online learning and our novel ensemble ICP online learning proposed in this study, are experimented and compared against supervised learning baseline, with Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) as the classifiers. Our n...
Dietary Supplements and Nutraceuticals Under Investigation for COVID-19 Prevention and Treatment.
EN: Coronavirus disease 2019 (COVID-19) has caused global disruption and a significant loss of life. Existing treatments that can be repurposed as prophylactic and therapeutic agents could reduce the pandemic's devastation. Emerging evidence of potential applications in other therapeutic contexts has led to the investigation of dietary supplements and nutraceuticals for COVID-19. Such products include vitamin C, vitamin D, omega 3 polyunsaturated fatty acids, probiotics, and zinc, all of which are currently under clinical investigation. In this review, we critically appraise the evidence surrounding dietary supplements and nutraceuticals for the prophylaxis and treatment of COVID-19. Overall, further study is required before evidence-based recommendations can be formulated, but nutritional status plays a significant role in patient outcomes, and these products could help alleviate deficiencies. For example, evidence indicates that vitamin D deficiency may be associated with greater incidence of infection and severity of COVID-19, suggesting that vitamin D supplementation may hold prophylactic or therapeutic value. A growing number of scientific organizations are now considering recomme...
Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction.
EN: Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies that generates multi-scale biomedical big data. Link prediction is increasingly used especially in bipartite biomedical networks to identify hidden biological interactions and relationshipts between key entities such as compounds, targets, gene and diseases. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). Our proposed GPLP method significantly outperforms over the state-of-the-art baselines. In addition, different network incompleteness is analysed with our devised protocol, and we also design an effective approach to improve the model robustness towards incomplete networks. Our method demonstrates the potentia...
Heterogeneous Graph based Deep Learning for Biomedical Network Link Prediction.
EN: Multi-scale biomedical knowledge networks are expanding with emerging experimental technologies that generates multi-scale biomedical big data. Link prediction is increasingly used especially in bipartite biomedical networks to identify hidden biological interactions and relationshipts between key entities such as compounds, targets, gene and diseases. We propose a Graph Neural Networks (GNN) method, namely Graph Pair based Link Prediction model (GPLP), for predicting biomedical network links simply based on their topological interaction information. In GPLP, 1-hop subgraphs extracted from known network interaction matrix is learnt to predict missing links. To evaluate our method, three heterogeneous biomedical networks were used, i.e. Drug-Target Interaction network (DTI), Compound-Protein Interaction network (CPI) from NIH Tox21, and Compound-Virus Inhibition network (CVI). Our proposed GPLP method significantly outperforms over the state-of-the-art baselines. In addition, different network incompleteness is analysed with our devised protocol, and we also design an effective approach to improve the model robustness towards incomplete networks. Our method demonstrates the potentia...
Analysis of skin lesion images with deep learning.
EN: Skin cancer is the most common cancer worldwide, with melanoma being the deadliest form. Dermoscopy is a skin imaging modality that has shown an improvement in the diagnosis of skin cancer compared to visual examination without support. We evaluate the current state of the art in the classification of dermoscopic images based on the ISIC-2019 Challenge for the classification of skin lesions and current literature. Various deep neural network architectures pre-trained on the ImageNet data set are adapted to a combined training data set comprised of publicly available dermoscopic and clinical images of skin lesions using transfer learning and model fine-tuning. The performance and applicability of these models for the detection of eight classes of skin lesions are examined. Real-time data augmentation, which uses random rotation, translation, shear, and zoom within specified bounds is used to increase the number of available training samples. Model predictions are multiplied by inverse class frequencies and normalized to better approximate actual probability distributions. Overall prediction accuracy is further increased by using the arithmetic mean of the predictions of several inde...
Rayleigh-Bénard convection of a model emulsion: anomalous heat-flux fluctuations and finite-size droplets effects.
EN: We present mesoscale numerical simulations of Rayleigh-Bénard (RB) convection in a two-dimensional model emulsion. The systems under study are constituted of finite-size droplets, whose concentration Phi_0 is systematically varied from small (Newtonian emulsions) to large values (non-Newtonian emulsions). We focus on the characterisation of the heat transfer properties close to the transition from conductive to convective states, where it is known that a homogeneous Newtonian system exhibits a steady flow and a time-independent heat flux. In marked contrast, emulsions exhibit a non-steady dynamics with fluctuations in the heat flux. In this paper, we aim at the characterisation of such non-steady dynamics via detailed studies on the time-averaged heat flux and its fluctuations. To understand the time-averaged heat flux, we propose a side-by-side comparison between the emulsion system and a single-phase (SP) system, whose viscosity is constructed from the shear rheology of the emulsion. We show that such local closure works well only when a suitable degree of coarse-graining (at the droplet scale) is introduced in the local viscosity. To delve deeper into the fluctuations in the hea...
Extracting Pasture Phenotype and Biomass Percentages using Weakly Supervised Multi-target Deep Learning on a Small Dataset.
EN: The dairy industry uses clover and grass as fodder for cows. Accurate estimation of grass and clover biomass yield enables smart decisions in optimizing fertilization and seeding density, resulting in increased productivity and positive environmental impact. Grass and clover are usually planted together, since clover is a nitrogen-fixing plant that brings nutrients to the soil. Adjusting the right percentages of clover and grass in a field reduces the need for external fertilization. Existing approaches for estimating the grass-clover composition of a field are expensive and time consuming - random samples of the pasture are clipped and then the components are physically separated to weigh and calculate percentages of dry grass, clover and weeds in each sample. There is growing interest in developing novel deep learning based approaches to non-destructively extract pasture phenotype indicators and biomass yield predictions of different plant species from agricultural imagery collected from the field. Providing these indicators and predictions from images alone remains a significant challenge. Heavy occlusions in the dense mixture of grass, clover and weeds make it difficult to esti...
Adhesion as a trigger of droplet polarization in flowing emulsions.
EN: Tissues are subjected to large external forces and undergo global deformations during morphogenesis. We use synthetic analogues of tissues to study the impact of cell-cell adhesion on the response of cohesive cellular assemblies under such stresses. In particular, we use biomimetic emulsions in which the droplets are functionalized in order to exhibit specific droplet-droplet adhesion. We flow these emulsions in microfluidic constrictions and study their response to this forced deformation via confocal microscopy. We find that the distributions of avalanche sizes are conserved between repulsive and adhesive droplets. However, adhesion locally impairs the rupture of droplet-droplet contacts, which in turn pulls on the rearranging droplets. As a result, adhesive droplets are a lot more deformed along the axis of elongation in the constriction. This finding could shed light on the origin of polarization processes during morphogenesis.
Predicting Illness for a Sustainable Dairy Agriculture: Predicting and Explaining the Onset of Mastitis in Dairy Cows.
EN: Mastitis is a billion dollar health problem for the modern dairy industry, with implications for antibiotic resistance. The use of AI techniques to identify the early onset of this disease, thus has significant implications for the sustainability of this agricultural sector. Current approaches to treating mastitis involve antibiotics and this practice is coming under ever increasing scrutiny. Using machine learning models to identify cows at risk of developing mastitis and applying targeted treatment regimes to only those animals promotes a more sustainable approach. Incorrect predictions from such models, however, can lead to monetary losses, unnecessary use of antibiotics, and even the premature death of animals, so it is important to generate compelling explanations for predictions to build trust with users and to better support their decision making. In this paper we demonstrate a system developed to predict mastitis infections in cows and provide explanations of these predictions using counterfactuals. We demonstrate the system and describe the engagement with farmers undertaken to build it.
An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain.
EN: With the growing amount of text in health data, there have been rapid advances in large pre-trained models that can be applied to a wide variety of biomedical tasks with minimal task-specific modifications. Emphasizing the cost of these models, which renders technical replication challenging, this paper summarizes experiments conducted in replicating BioBERT and further pre-training and careful fine-tuning in the biomedical domain. We also investigate the effectiveness of domain-specific and domain-agnostic pre-trained models across downstream biomedical NLP tasks. Our finding confirms that pre-trained models can be impactful in some downstream NLP tasks (QA and NER) in the biomedical domain; however, this improvement may not justify the high cost of domain-specific pre-training.
An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain.
EN: With the growing amount of text in health data, there have been rapid advances in large pre-trained models that can be applied to a wide variety of biomedical tasks with minimal task-specific modifications. Emphasizing the cost of these models, which renders technical replication challenging, this paper summarizes experiments conducted in replicating BioBERT and further pre-training and careful fine-tuning in the biomedical domain. We also investigate the effectiveness of domain-specific and domain-agnostic pre-trained models across downstream biomedical NLP tasks. Our finding confirms that pre-trained models can be impactful in some downstream NLP tasks (QA and NER) in the biomedical domain; however, this improvement may not justify the high cost of domain-specific pre-training.
Learn molecular representations from large-scale unlabeled molecules for drug discovery.
EN: How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline.
Learn molecular representations from large-scale unlabeled molecules for drug discovery.
EN: How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline.
Learn molecular representations from large-scale unlabeled molecules for drug discovery.
EN: How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline.
A Comparative Analysis of the Ensemble Methods for Drug Design.
EN: Quantitative structure-activity relationship (QSAR) is a computer modeling technique for identifying relationships between the structural properties of chemical compounds and biological activity. QSAR modeling is necessary for drug discovery, but it has many limitations. Ensemble-based machine learning approaches have been used to overcome limitations and generate reliable predictions. Ensemble learning creates a set of diverse models and combines them. In our comparative analysis, each ensemble algorithm was paired with each of the basic algorithms, but the basic algorithms were also investigated separately. In this configuration, 57 algorithms were developed and compared on 4 different datasets. Thus, a technique for complex ensemble method is proposed that builds diversified models and integrates them. The proposed individual models did not show impressive results as a unified model, but it was considered the most important predictor when combined. We assessed whether ensembles always give better results than individual algorithms. The Python code written to get experimental results in this article has been uploaded to Github (https://github.com/rifqat/Comparative-Analysis).
A Review of Hidden Markov Models and Recurrent Neural Networks for Event Detection and Localization in Biomedical Signals.
EN: Biomedical signals carry signature rhythms of complex physiological processes that control our daily bodily activity. The properties of these rhythms indicate the nature of interaction dynamics among physiological processes that maintain a homeostasis. Abnormalities associated with diseases or disorders usually appear as disruptions in the structure of the rhythms which makes isolating these rhythms and the ability to differentiate between them, indispensable. Computer aided diagnosis systems are ubiquitous nowadays in almost every medical facility and more closely in wearable technology, and rhythm or event detection is the first of many intelligent steps that they perform. How these rhythms are isolated? How to develop a model that can describe the transition between processes in time? Many methods exist in the literature that address these questions and perform the decoding of biomedical signals into separate rhythms. In here, we demystify the most effective methods that are used for detection and isolation of rhythms or events in time series and highlight the way in which they were applied to different biomedical signals and how they contribute to information fusion. The key st...
A Review of Hidden Markov Models and Recurrent Neural Networks for Event Detection and Localization in Biomedical Signals.
EN: Biomedical signals carry signature rhythms of complex physiological processes that control our daily bodily activity. The properties of these rhythms indicate the nature of interaction dynamics among physiological processes that maintain a homeostasis. Abnormalities associated with diseases or disorders usually appear as disruptions in the structure of the rhythms which makes isolating these rhythms and the ability to differentiate between them, indispensable. Computer aided diagnosis systems are ubiquitous nowadays in almost every medical facility and more closely in wearable technology, and rhythm or event detection is the first of many intelligent steps that they perform. How these rhythms are isolated? How to develop a model that can describe the transition between processes in time? Many methods exist in the literature that address these questions and perform the decoding of biomedical signals into separate rhythms. In here, we demystify the most effective methods that are used for detection and isolation of rhythms or events in time series and highlight the way in which they were applied to different biomedical signals and how they contribute to information fusion. The key st...
Utilising Graph Machine Learning within Drug Discovery and Development.
EN: Graph Machine Learning (GML) is receiving growing interest within the pharmaceutical and biotechnology industries for its ability to model biomolecular structures, the functional relationships between them, and integrate multi-omic datasets - amongst other data types. Herein, we present a multidisciplinary academic-industrial review of the topic within the context of drug discovery and development. After introducing key terms and modelling approaches, we move chronologically through the drug development pipeline to identify and summarise work incorporating: target identification, design of small molecules and biologics, and drug repurposing. Whilst the field is still emerging, key milestones including repurposed drugs entering in vivo studies, suggest graph machine learning will become a modelling framework of choice within biomedical machine learning.
An autoencoder wavelet based deep neural network with attention mechanism for multistep prediction of plant growth.
EN: Multi-step prediction is considered of major significance for time series analysis in many real life problems. Existing methods mainly focus on one-step-ahead forecasting, since multiple step forecasting generally fails due to accumulation of prediction errors. This paper presents a novel approach for predicting plant growth in agriculture, focusing on prediction of plant Stem Diameter Variations (SDV). The proposed approach consists of three main steps. At first, wavelet decomposition is applied to the original data, as to facilitate model fitting and reduce noise in them. Then an encoder-decoder framework is developed using Long Short Term Memory (LSTM) and used for appropriate feature extraction from the data. Finally, a recurrent neural network including LSTM and an attention mechanism is proposed for modelling long-term dependencies in the time series data. Experimental results are presented which illustrate the good performance of the proposed approach and that it significantly outperforms the existing models, in terms of error criteria such as RMSE, MAE and MAPE.
Concentrated phase emulsion with multi-core morphology under shear: A numerical study.
EN: We numerically study the dynamic behavior under a symmetric shear flow of selected examples of concentrated phase emulsions with multi-core morphology confined within a microfluidic channel. A variety of new nonequilibrium steady states is reported. Under low shear rates, the emulsion is found to exhibit a solid-like behavior, in which cores display a periodic planetary-like motion with approximately equal angular velocity. At higher shear rates two steady states emerge, one in which all inner cores align along the flow and become essentially motionless and a further one in which some cores accumulate near the outer interface and produce a dynamical elliptical-shaped ring chain, reminiscent of a treadmilling-like structure, while others occupy the center of the emulsion. A quantitative description in terms of i) motion of the cores, ii) rate of deformation of the emulsion and iii) structure of the fluid flow within the channel is also provided.
Localization of Malaria Parasites and White Blood Cells in Thick Blood Smears.
EN: Effectively determining malaria parasitemia is a critical aspect in assisting clinicians to accurately determine the severity of the disease and provide quality treatment. Microscopy applied to thick smear blood smears is the de facto method for malaria parasitemia determination. However, manual quantification of parasitemia is time consuming, laborious and requires considerable trained expertise which is particularly inadequate in highly endemic and low resourced areas. This study presents an end-to-end approach for localisation and count of malaria parasites and white blood cells (WBCs) which aid in the effective determination of parasitemia; the quantitative content of parasites in the blood. On a dataset of slices of images of thick blood smears, we build models to analyse the obtained digital images. To improve model performance due to the limited size of the dataset, data augmentation was applied. Our preliminary results show that our deep learning approach reliably detects and returns a count of malaria parasites and WBCs with a high precision and recall. We also evaluate our system against human experts and results indicate a strong correlation between our deep learning mod...
Cross-Modal Retrieval and Synthesis (X-MRS): Closing the Modality Gap in Shared Representation Learning.
EN: Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce ...
THCluster: herb supplements categorization for precision traditional Chinese medicine.
EN: There has been a continuing demand for traditional and complementary medicine worldwide. A fundamental and important topic in Traditional Chinese Medicine (TCM) is to optimize the prescription and to detect herb regularities from TCM data. In this paper, we propose a novel clustering model to solve this general problem of herb categorization, a pivotal task of prescription optimization and herb regularities. The model utilizes Random Walks method, Bayesian rules and Expectation Maximization(EM) models to complete a clustering analysis effectively on a heterogeneous information network. We performed extensive experiments on the real-world datasets and compared our method with other algorithms and experts. Experimental results have demonstrated the effectiveness of the proposed model for discovering useful categorization of herbs and its potential clinical manifestations.
Skin disease diagnosis with deep learning: a review.
EN: Skin cancer is one of the most threatening diseases worldwide. However, diagnosing skin cancer correctly is challenging. Recently, deep learning algorithms have emerged to achieve excellent performance on various tasks. Particularly, they have been applied to the skin disease diagnosis tasks. In this paper, we present a review on deep learning methods and their applications in skin disease diagnosis. We first present a brief introduction to skin diseases and image acquisition methods in dermatology, and list several publicly available skin datasets for training and testing algorithms. Then, we introduce the conception of deep learning and review popular deep learning architectures. Thereafter, popular deep learning frameworks facilitating the implementation of deep learning algorithms and performance evaluation metrics are presented. As an important part of this article, we then review the literature involving deep learning methods for skin disease diagnosis from several aspects according to the specific tasks. Additionally, we discuss the challenges faced in the area and suggest possible future research directions. The major purpose of this article is to provide a conceptual and s...
Biomedical Information Extraction for Disease Gene Prioritization.
EN: We introduce a biomedical information extraction (IE) pipeline that extracts biological relationships from text and demonstrate that its components, such as named entity recognition (NER) and relation extraction (RE), outperform state-of-the-art in BioNLP. We apply it to tens of millions of PubMed abstracts to extract protein-protein interactions (PPIs) and augment these extractions to a biomedical knowledge graph that already contains PPIs extracted from STRING, the leading structured PPI database. We show that, despite already containing PPIs from an established structured source, augmenting our own IE-based extractions to the graph allows us to predict novel disease-gene associations with a 20% relative increase in hit@30, an important step towards developing drug targets for uncured diseases.
Biomedical Information Extraction for Disease Gene Prioritization.
EN: We introduce a biomedical information extraction (IE) pipeline that extracts biological relationships from text and demonstrate that its components, such as named entity recognition (NER) and relation extraction (RE), outperform state-of-the-art in BioNLP. We apply it to tens of millions of PubMed abstracts to extract protein-protein interactions (PPIs) and augment these extractions to a biomedical knowledge graph that already contains PPIs extracted from STRING, the leading structured PPI database. We show that, despite already containing PPIs from an established structured source, augmenting our own IE-based extractions to the graph allows us to predict novel disease-gene associations with a 20% relative increase in hit@30, an important step towards developing drug targets for uncured diseases.
Design Of Drug-Like Protein-Protein Interaction Stabilizers Guided By Chelation-Controlled Bioactive Conformation Stabilization.
EN: The protein-protein interactions (PPIs) of 14-3-3 proteins are a model system for studying PPI stabilization. The complex natural product Fusicoccin A stabilizes many 14-3-3 PPIs but is not amenable for use in SAR studies, motivating the search for more drug-like chemical matter. However, drug-like 14-3-3 PPI stabilizers enabling such study have remained elusive. An X-ray crystal structure of a PPI in complex with an extremely low potency stabilizer uncovered an unexpected non-protein interacting, ligand-chelated Mg 2+ leading to the discovery of metal ion-dependent 14-3-3 PPI stabilization potency. This originates from a novel chelation-controlled bioactive conformation stabilization effect. Metal chelation has been associated with pan-assay interference compounds (PAINS) and frequent hitter behavior, but chelation can evidently also lead to true potency gains and find use as a medicinal chemistry strategy to guide compound optimization. To demonstrate this, we exploited the effect to design the first potent, selective and drug-like 14-3-3 PPI stabilizers.
Analyzing the Effect of Multi-task Learning for Biomedical Named Entity Recognition.
EN: Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these methods are relatively scarce in the biomedical domain, and a theoretical understanding of why these methods improve the performance is lacking. In this study, we performed an extensive analysis to understand the transferability between different biomedical entity datasets. We found useful measures to predict transferability between these datasets. Besides, we propose combining transfer learning and multi-task learning to improve the performance of biomedical named entity recognition systems, which is not applied before to the best of our knowledge.
Analyzing the Effect of Multi-task Learning for Biomedical Named Entity Recognition.
EN: Developing high-performing systems for detecting biomedical named entities has major implications. State-of-the-art deep-learning based solutions for entity recognition often require large annotated datasets, which is not available in the biomedical domain. Transfer learning and multi-task learning have been shown to improve performance for low-resource domains. However, the applications of these methods are relatively scarce in the biomedical domain, and a theoretical understanding of why these methods improve the performance is lacking. In this study, we performed an extensive analysis to understand the transferability between different biomedical entity datasets. We found useful measures to predict transferability between these datasets. Besides, we propose combining transfer learning and multi-task learning to improve the performance of biomedical named entity recognition systems, which is not applied before to the best of our knowledge.
How additive manufacturing can boost the bioactivity of baked functional foods.
EN: The antioxidant activity of baked foods is of utmost interest when envisioning enhancing their health benefits. Incorporating functional ingredients is challenging since their bioactivity naturally declines during baking. In this study, 3D food printing and design of experiments are employed to clarify how the antioxidant activity of cookies enriched with encapsulated polyphenols can be maximized. A synergistic effect between encapsulation, time, temperature, number of layers, and infill of the printed cookies was observed on the moisture and antioxidant activity. Four-layer cookies with 30 % infill provided the highest bioactivity and phenolic content if baked for 10 min and at 180 °C. The bioacitivity and total phenolic content improved by 115 % and 173 %, respectively, comparing to free extract cookies. Moreover, the proper combination of the design and baking variables allowed to vary the bioactivity of cooked cookies (moisture 3-5 %) between 300 to 700 μmolTR/gdry. The additive manufacture of foods with interconnected pores could accelerate baking and browning, or reduce thermal degradation. This represents a potential approach to enhance the functional and healthy properties ...
Investigating 3D Atomic Environments for Enhanced QSAR.
EN: Predicting bioactivity and physical properties of molecules is a longstanding challenge in drug design. Most approaches use molecular descriptors based on a 2D representation of molecules as a graph of atoms and bonds, abstracting away the molecular shape. A difficulty in accounting for 3D shape is in designing molecular descriptors can precisely capture molecular shape while remaining invariant to rotations/translations. We describe a novel alignment-free 3D QSAR method using Smooth Overlap of Atomic Positions (SOAP), a well-established formalism developed for interpolating potential energy surfaces. We show that this approach rigorously describes local 3D atomic environments to compare molecular shapes in a principled manner. This method performs competitively with traditional fingerprint-based approaches as well as state-of-the-art graph neural networks on pIC$_{50}$ ligand-binding prediction in both random and scaffold split scenarios. We illustrate the utility of SOAP descriptors by showing that its inclusion in ensembling diverse representations statistically improves performance, demonstrating that incorporating 3D atomic environments could lead to enhanced QSAR for cheminfo...
Self-Alignment Pretraining for Biomedical Entity Representations.
EN: Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.
Self-Alignment Pretraining for Biomedical Entity Representations.
EN: Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.
Predicting Biomedical Interactions with Higher-Order Graph Convolutional Networks.
EN: Biomedical interaction networks have incredible potential to be useful in the prediction of biologically meaningful interactions, identification of network biomarkers of disease, and the discovery of putative drug targets. Recently, graph neural networks have been proposed to effectively learn representations for biomedical entities and achieved state-of-the-art results in biomedical interaction prediction. These methods only consider information from immediate neighbors but cannot learn a general mixing of features from neighbors at various distances. In this paper, we present a higher-order graph convolutional network (HOGCN) to aggregate information from the higher-order neighborhood for biomedical interaction prediction. Specifically, HOGCN collects feature representations of neighbors at various distances and learns their linear mixing to obtain informative representations of biomedical entities. Experiments on four interaction networks, including protein-protein, drug-drug, drug-target, and gene-disease interactions, show that HOGCN achieves more accurate and calibrated predictions. HOGCN performs well on noisy, sparse interaction networks when feature representations of neig...
Predicting Biomedical Interactions with Higher-Order Graph Convolutional Networks.
EN: Biomedical interaction networks have incredible potential to be useful in the prediction of biologically meaningful interactions, identification of network biomarkers of disease, and the discovery of putative drug targets. Recently, graph neural networks have been proposed to effectively learn representations for biomedical entities and achieved state-of-the-art results in biomedical interaction prediction. These methods only consider information from immediate neighbors but cannot learn a general mixing of features from neighbors at various distances. In this paper, we present a higher-order graph convolutional network (HOGCN) to aggregate information from the higher-order neighborhood for biomedical interaction prediction. Specifically, HOGCN collects feature representations of neighbors at various distances and learns their linear mixing to obtain informative representations of biomedical entities. Experiments on four interaction networks, including protein-protein, drug-drug, drug-target, and gene-disease interactions, show that HOGCN achieves more accurate and calibrated predictions. HOGCN performs well on noisy, sparse interaction networks when feature representations of neig...
Generating 3D Molecular Structures Conditional on a Receptor Binding Site with Deep Generative Models.
EN: Deep generative models have been applied with increasing success to the generation of two dimensional molecules as SMILES strings and molecular graphs. In this work we describe for the first time a deep generative model that can generate 3D molecular structures conditioned on a three-dimensional (3D) binding pocket. Using convolutional neural networks, we encode atomic density grids into separate receptor and ligand latent spaces. The ligand latent space is variational to support sampling of new molecules. A decoder network generates atomic densities of novel ligands conditioned on the receptor. Discrete atoms are then fit to these continuous densities to create molecular structures. We show that valid and unique molecules can be readily sampled from the variational latent space defined by a reference `seed' structure and generated structures have reasonable interactions with the binding site. As structures are sampled farther in latent space from the seed structure, the novelty of the generated structures increases, but the predicted binding affinity decreases. Overall, we demonstrate the feasibility of conditional 3D molecular structure generation and provide a starting point for...
Advances to tackle backbone flexibility in protein docking.
EN: Computational docking methods can provide structural models of protein-protein complexes, but protein backbone flexibility upon association often thwarts accurate predictions. In recent blind challenges, medium or high accuracy models were submitted in less than 20% of the "difficult" targets (with significant backbone change or uncertainty). Here, we describe recent developments in protein-protein docking and highlight advances that tackle backbone flexibility. In molecular dynamics and Monte Carlo approaches, enhanced sampling techniques have reduced time-scale limitations. Internal coordinate formulations can now capture realistic motions of monomers and complexes using harmonic dynamics. And machine learning approaches adaptively guide docking trajectories or generate novel binding site predictions from deep neural networks trained on protein interfaces. These tools poise the field to break through the longstanding challenge of correctly predicting complex structures with significant conformational change.
Prediction and mitigation of mutation threats to COVID-19 vaccines and antibody therapies.
EN: Antibody therapeutics and vaccines are among our last resort to end the raging COVID-19 pandemic. They, however, are prone to over 5,000 mutations on the spike (S) protein uncovered by a Mutation Tracker based on over 200,000 genome isolates. It is imperative to understand how mutations would impact vaccines and antibodies in the development. In this work, we study the mechanism, frequency, and ratio of mutations on the S protein. Additionally, we use 56 antibody structures and analyze their 2D and 3D characteristics. Moreover, we predict the mutation-induced binding free energy (BFE) changes for the complexes of S protein and antibodies or ACE2. By integrating genetics, biophysics, deep learning, and algebraic topology, we reveal that most of 462 mutations on the receptor-binding domain (RBD) will weaken the binding of S protein and antibodies and disrupt the efficacy and reliability of antibody therapies and vaccines. A list of 31 vaccine escape mutants is identified, while many other disruptive mutations are detailed as well. We also unveil that about 65\% existing RBD mutations, including those variants recently found in the United Kingdom (UK) and South Africa, are binding-str...
Quenching to fix metastable states in models of prebiotic chemistry.
EN: For prebiotic chemistry to succeed in producing a starting metastable, autocatalytic and reproducing system subject to evolutionary selection it must satisfy at least two apparently contradictory requirements: Because such systems are rare, a search among vast numbers of molecular combinations must take place naturally, requiring rapid rearrangement and breaking of covalent bonds. But once a relevant system is found, such rapid disruption and rearrangement would be very likely to destroy the system before much evolution could take place. In this paper we explore the possibility, using a model developed previously, that the search process could occur under different environmental conditions than the subsequent fixation and growth of a lifelike chemical system. We use the example of a rapid change in temperature to illustrate the effect and refer to the rapid change as a `quench' borrowing terminology from study of the physics and chemistry of glass formation. The model study shows that interrupting a high temperature nonequilibrium state with a rapid quench to lower temperatures can substantially increase the probability of producing a chemical state with lifelike characteristics of...
Addressing the Real-world Class Imbalance Problem in Dermatology.
EN: Class imbalance is a common problem in medical diagnosis, causing a standard classifier to be biased towards the common classes and perform poorly on the rare classes. This is especially true for dermatology, a specialty with thousands of skin conditions but many of which have low prevalence in the real world. Motivated by recent advances, we explore few-shot learning methods as well as conventional class imbalance techniques for the skin condition recognition problem and propose an evaluation setup to fairly assess the real-world utility of such approaches. We find the performance of few-show learning methods does not reach that of conventional class imbalance techniques, but combining the two approaches using a novel ensemble improves model performance, especially for rare classes. We conclude that ensembling can be useful to address the class imbalance problem, yet progress can further be accelerated by real-world evaluation setups for benchmarking new methods.
Categorizing Online Shopping Behavior from Cosmetics to Electronics: An Analytical Framework.
EN: A success factor for modern companies in the age of Digital Marketing is to understand how customers think and behave based on their online shopping patterns. While the conventional method of gathering consumer insights through questionnaires and surveys still form the bases of descriptive analytics for market intelligence units, we propose a machine learning framework to automate this process. In this paper we present a modular consumer data analysis platform that processes session level interaction records between users and products to predict session level, user journey level and customer behavior specific patterns leading towards purchase events. We explore the computational framework and provide test results on two Big data sets-cosmetics and consumer electronics of size 2GB and 15GB, respectively. The proposed system achieves 97-99% classification accuracy and recall for user-journey level purchase predictions and categorizes buying behavior into 5 clusters with increasing purchase ratios for both data sets. Thus, the proposed framework is extendable to other large e-commerce data sets to obtain automated purchase predictions and descriptive consumer insights.
Convective heat transfer of a model emulsion at the droplet scale.
EN: We numerically study the Rayleigh-Bénard (RB) convection in two-dimensional model emulsions confined between two parallel walls at fixed temperatures. The systems under study are heterogeneous, with finite-size droplets dispersed in a continuous phase. The droplet concentration is chosen to explore the convective heat transfer of both Newtonian (low droplet concentration) and non-Newtonian (high droplet concentration) emulsions, the latter exhibiting shear-thinning rheology, with a noticeable increase of viscosity at low shear rates. It is well known that the transition to convection of a homogeneous Newtonian system is accompanied by the onset of steady flow and time-independent heat flux; in marked contrast, the heterogeneity of emulsions brings in an additional and previously unexplored phenomenology. As a matter of fact, when the droplet concentration increases, we observe that the heat transfer process is mediated by a non-steady flow, with neat heat-flux fluctuations, obeying a non-Gaussian statistics. The observed findings are ascribed to the emergence of space correlations among distant droplets, which we highlight via direct measurements of the droplets displacement and th...
Effective modelling of the Rayleigh-Bénard convection of concentrated emulsions with finite-size droplets.
EN: We present mesoscale numerical simulations of Rayleigh-Bénard convection in a two-dimensional concentrated emulsion, confined between two parallel walls, heated from below and cooled from above, under the effect of buoyancy forces. The systems under study comprise finite-size droplets, whose concentration $Φ_0$ is varied, ranging from the dilute limit up to the point where the emulsion starts to be packed and exhibits non-Newtonian rheology. We focus on the characterisation of the convective heat transfer properties close to the transition from conductive to convective states. The convective flow is confined and heterogeneous, which causes the emulsion to exhibit concentration heterogeneities in space $φ_0(y)$, depending on the location in the wall-to-wall direction ($y$). With the aim of assessing quantitatively the heat transfer efficiency of such heterogeneous systems, we resort to a side-by-side comparison between the concentrated emulsion system and a single-phase (SP) system, whose local viscosity $η^{\mbox{SP}}(y)$ is suitably constructed from the shear rheology of the emulsion. Such comparison highlights that a suitable degree $Λ$ of coarse-graining needs to be introduced i...
Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation.
EN: Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly capture the temporal dynamics of scientific term relations and also assume unobserved connections to be irrelevant (i.e., in a positive-negative (PN) learning setting). To break these limits, we formulate this HG problem as future connectivity prediction task on a dynamic attributed graph via positive-unlabeled (PU) learning. Then, the key is to capture the temporal evolution of node pair (term pair) relations from just the positive and unlabeled data. We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.
Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation.
EN: Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly capture the temporal dynamics of scientific term relations and also assume unobserved connections to be irrelevant (i.e., in a positive-negative (PN) learning setting). To break these limits, we formulate this HG problem as future connectivity prediction task on a dynamic attributed graph via positive-unlabeled (PU) learning. Then, the key is to capture the temporal evolution of node pair (term pair) relations from just the positive and unlabeled data. We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.
AI Progress in Skin Lesion Analysis.
EN: We examine progress in the use of AI for detecting skin lesions, with particular emphasis on the erythema migrans rash of acute Lyme disease, and other lesions, such as those from conditions like herpes zoster (shingles), tinea corporis, erythema multiforme, cellulitis, insect bites, or tick bites. We discuss important challenges for these applications, in particular the problems of AI bias regarding the lack of skin images in dark skinned individuals, being able to accurately detect, delineate, and segment lesions or regions of interest compared to normal skin in images, and low shot learning (addressing classification with a paucity of training images). Solving these problems ranges from being highly desirable requirements -- e.g. for delineation, which may be useful to disambiguate between similar types of lesions, and perform improved diagnostics -- or required, as is the case for AI de-biasing, to allow for the deployment of fair AI techniques in the clinic for skin lesion analysis. For the problem of low shot learning in particular, we report skin analysis algorithms that gracefully degrade and still perform well at low shots, when compared to baseline algorithms: when using ...
Heterogeneous Molecular Graph Neural Networks for Predicting Molecule Properties.
EN: As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and r...
Heterogeneous Molecular Graph Neural Networks for Predicting Molecule Properties.
EN: As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and r...
Heterogeneous Molecular Graph Neural Networks for Predicting Molecule Properties.
EN: As they carry great potential for modeling complex interactions, graph neural network (GNN)-based methods have been widely used to predict quantum mechanical properties of molecules. Most of the existing methods treat molecules as molecular graphs in which atoms are modeled as nodes. They characterize each atom's chemical environment by modeling its pairwise interactions with other atoms in the molecule. Although these methods achieve a great success, limited amount of works explicitly take many-body interactions, i.e., interactions between three and more atoms, into consideration. In this paper, we introduce a novel graph representation of molecules, heterogeneous molecular graph (HMG) in which nodes and edges are of various types, to model many-body interactions. HMGs have the potential to carry complex geometric information. To leverage the rich information stored in HMGs for chemical prediction problems, we build heterogeneous molecular graph neural networks (HMGNN) on the basis of a neural message passing scheme. HMGNN incorporates global molecule representations and an attention mechanism into the prediction process. The predictions of HMGNN are invariant to translation and r...
Encapsulation of fragrances and oils by core-shell structures from silica nanoparticles, surfactant and polymer: Effect of particle size.
EN: Oils and fragrances can be encapsulated by using composite shells of silica nanoparticles, polymer and surfactant (potassium oleate). The template for the creation of the core-shell structure is a particle stabilized (Pickering) emulsion. The surfactant adsorbs on the nanoparticles and leads to their reversible hydrophobization and adsorption on the oil-water interface. The outer layer of the self-assembled shell represents a layer from crosslinked polymer. The procedure of encapsulation is simple and includes single homogenization by ultrasound of the formulation that contains all ingredients together. The produced capsules have mean radius in the range between 2 and 11 microns. By order of magnitude and trend, the capsule size follows the law of limited coalescence with respect to the dependence on nanoparticle size and concentration. The composite structure of the shells leads also to dependence on the concentrations of added polymer and surfactant. The produced microcapsules are stable when rinsed with pure water of pH in the range 3 - 10. However, if dispersed in water of pH > 11, the microcapsules are destabilized and release their cargo, i.e., they are pH-responsive. Various...
Faceting and flattening of emulsion droplets: a mechanical model.
EN: When cooled down, emulsion droplets stabilized by a frozen interface of alkane molecules and surfactants have been observed to undergo a spectacular sequence of morphological transformations: from spheres to faceted icosahedra, down to flattened liquid platelets. While generally ascribed to the interplay between the elasticity of the frozen interface and surface tension, the physical mechanisms underpinning these transitions have remained elusive, despite different theoretical pictures having been proposed in recent years. In this article, we introduce a comprehensive mechanical model of morphing emulsion droplets, which quantitatively accounts for various experimental observations, including the scaling behavior of the faceting transition. Our analysis highlights the role of gravity and the spontaneous curvature of the frozen interface in determining the specific transition pathway.
Graph-convolution neural network-based flexible docking utilizing coarse-grained distance matrix.
EN: Prediction of protein-ligand complexes for flexible proteins remains still a challenging problem in computational structural biology and drug design. Here we present two novel deep neural network approaches with significant improvement in efficiency and accuracy of binding mode prediction on a large and diverse set of protein systems compared to standard docking. Whereas the first graph convolutional network is used for re-ranking poses the second approach aims to generate and rank poses independent of standard docking approaches. This novel approach relies on the prediction of distance matrices between ligand atoms and protein C_alpha atoms thus incorporating side-chain flexibility implicitly.
Properties Of Winning Tickets On Skin Lesion Classification.
EN: Skin cancer affects a large population every year -- automated skin cancer detection algorithms can thus greatly help clinicians. Prior efforts involving deep learning models have high detection accuracy. However, most of the models have a large number of parameters, with some works even using an ensemble of models to achieve good accuracy. In this paper, we investigate a recently proposed pruning technique called Lottery Ticket Hypothesis. We find that iterative pruning of the network resulted in improved accuracy, compared to that of the unpruned network, implying that -- the lottery ticket hypothesis can be applied to the problem of skin cancer detection and this hypothesis can result in a smaller network for inference. We also examine the accuracy across sub-groups -- created by gender and age -- and it was found that some sub-groups show a larger increase in accuracy than others.
Conceptualized Representation Learning for Chinese Biomedical Text Mining.
EN: Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://gi...
Conceptualized Representation Learning for Chinese Biomedical Text Mining.
EN: Biomedical text mining is becoming increasingly important as the number of biomedical documents and web data rapidly grows. Recently, word representation models such as BERT has gained popularity among researchers. However, it is difficult to estimate their performance on datasets containing biomedical texts as the word distributions of general and biomedical corpora are quite different. Moreover, the medical domain has long-tail concepts and terminologies that are difficult to be learned via language models. For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations. In this paper, we investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora and propose a novel conceptualized representation learning approach. We also release a new Chinese Biomedical Language Understanding Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach. Experimental results on the benchmark show that our approach could bring significant gain. We release the pre-trained model on GitHub: https://gi...
Generative chemistry: drug discovery with deep learning generative models.
EN: The de novo design of molecular structures using deep learning generative models introduces an encouraging solution to drug discovery in the face of the continuously increased cost of new drug development. From the generation of original texts, images, and videos, to the scratching of novel molecular structures, the incredible creativity of deep learning generative models surprised us about the height machine intelligence can achieve. The purpose of this paper is to review the latest advances in generative chemistry which relies on generative modeling to expedite the drug discovery process. This review starts with a brief history of artificial intelligence in drug discovery to outline this emerging paradigm. Commonly used chemical databases, molecular representations, and tools in cheminformatics and machine learning are covered as the infrastructure for the generative chemistry. The detailed discussions on utilizing cutting-edge generative architectures, including recurrent neural network, variational autoencoder, adversarial autoencoder, and generative adversarial network for compound generation are focused. Challenges and future perspectives follow.
Single bubble and drop techniques for characterizing foams and emulsions.
EN: The physics of foams and emulsions has traditionally been studied using bulk foam/emulsion tests and single film platforms such as the Scheludko cell. Recently there has been a renewed interest in a third class of techniques that we term as single bubble/drop tests, which employ isolated whole bubbles and drops to probe the characteristics of foams and emulsions. Single bubble and drop techniques provide a convenient framework for investigating a number of important characteristics of foams and emulsions, including the rheology, stabilization mechanisms, and rupture dynamics. In this review we provide a comprehensive discussion of the various single bubble/drop platforms and the associated experimental measurement protocols including the construction of coalescence time distributions, visualization of the thin film profiles and characterization of the interfacial rheological properties. Subsequently, we summarize the recent developments in foam and emulsion science with a focus on the results obtained through single bubble/drop techniques. We conclude the review by presenting important venues for future research.
Destabilization and phase separation of particle suspensions in emulsions.
EN: Yield stress fluids are widely used in industrial application to arrest dense solid particles, which can be studied by using a concentrated emulsion as a model fluid. We show in experiments that particle sedimentation in emulsions cannot be predicted by the classical criterion for spheres embedded in a yield stress fluid. Phase separation processes take place, where a liquid layer forms and particle sedimentation is enhanced by the emulsion drainage. In addition, emulsion drainage can be arrested or enhanced by the amount of particles embedded in the emulsion. A minimal mathematical model is developed and solved in numerical simulations to describe the emulsion drainage in the presence of particles, which favorably compares with the experimental stability diagram and the sedimentation dynamics.
A Multilingual Neural Machine Translation Model for Biomedical Data.
EN: We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and biomedical test sets, and that it outperforms the existing publicly released models. We believe that this release will help the large-scale multilingual analysis of the digital content of the COVID-19 crisis and of its effects on society, economy, and healthcare policies. We also release a test set of biomedical text for Korean-English. It consists of 758 sentences from official guidelines and recent papers, all about COVID-19.
A Multilingual Neural Machine Translation Model for Biomedical Data.
EN: We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain. The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English. It is trained with large amounts of generic and biomedical data, using domain tags. Our benchmarks show that it performs near state-of-the-art both on news (generic domain) and biomedical test sets, and that it outperforms the existing publicly released models. We believe that this release will help the large-scale multilingual analysis of the digital content of the COVID-19 crisis and of its effects on society, economy, and healthcare policies. We also release a test set of biomedical text for Korean-English. It consists of 758 sentences from official guidelines and recent papers, all about COVID-19.
Machine Learning in Nano-Scale Biomedical Engineering.
EN: Machine learning (ML) empowers biomedical systems with the capability to optimize their performance through modeling of the available data extremely well, without using strong assumptions about the modeled system. Especially in nano-scale biosystems, where the generated data sets are too vast and complex to mentally parse without computational assist, ML is instrumental in analyzing and extracting new insights, accelerating material and structure discoveries, and designing experience as well as supporting nano-scale communications and networks. However, despite these efforts, the use of ML in nano-scale biomedical engineering remains still under-explored in certain areas and research challenges are still open in fields such as structure and material design and simulations, communications and signal processing, and bio-medicine applications. In this article, we review the existing research regarding the use of ML in nano-scale biomedical engineering. In more detail, we first identify and discuss the main challenges that can be formulated as ML problems. These challenges are classified into the three aforementioned main categories. Next, we discuss the state of the art ML methodologi...
Machine Learning in Nano-Scale Biomedical Engineering.
EN: Machine learning (ML) empowers biomedical systems with the capability to optimize their performance through modeling of the available data extremely well, without using strong assumptions about the modeled system. Especially in nano-scale biosystems, where the generated data sets are too vast and complex to mentally parse without computational assist, ML is instrumental in analyzing and extracting new insights, accelerating material and structure discoveries, and designing experience as well as supporting nano-scale communications and networks. However, despite these efforts, the use of ML in nano-scale biomedical engineering remains still under-explored in certain areas and research challenges are still open in fields such as structure and material design and simulations, communications and signal processing, and bio-medicine applications. In this article, we review the existing research regarding the use of ML in nano-scale biomedical engineering. In more detail, we first identify and discuss the main challenges that can be formulated as ML problems. These challenges are classified into the three aforementioned main categories. Next, we discuss the state of the art ML methodologi...
Detection and Annotation of Plant Organs from Digitized Herbarium Scans using Deep Learning.
EN: As herbarium specimens are increasingly becoming digitized and accessible in online repositories, advanced computer vision techniques are being used to extract information from them. The presence of certain plant organs on herbarium sheets is useful information in various scientific contexts and automatic recognition of these organs will help mobilize such information. In our study we use deep learning to detect plant organs on digitized herbarium specimens with Faster R-CNN. For our experiment we manually annotated hundreds of herbarium scans with thousands of bounding boxes for six types of plant organs and used them for training and evaluating the plant organ detection model. The model worked particularly well on leaves and stems, while flowers were also present in large numbers in the sheets, but not equally well recognized.
Genome Sequence Classification for Animal Diagnostics with Graph Representations and Deep Neural Networks.
EN: Bovine Respiratory Disease Complex (BRDC) is a complex respiratory disease in cattle with multiple etiologies, including bacterial and viral. It is estimated that mortality, morbidity, therapy, and quarantine resulting from BRDC account for significant losses in the cattle industry. Early detection and management of BRDC are crucial in mitigating economic losses. Current animal disease diagnostics is based on traditional tests such as bacterial culture, serolog, and Polymerase Chain Reaction (PCR) tests. Even though these tests are validated for several diseases, their main challenge is their limited ability to detect the presence of multiple pathogens simultaneously. Advancements of data analytics and machine learning and applications over metagenome sequencing are setting trends on several applications. In this work, we demonstrate a machine learning approach to identify pathogen signatures present in bovine metagenome sequences using k-mer-based network embedding followed by a deep learning-based classification task. With experiments conducted on two different simulated datasets, we show that networks-based machine learning approaches can detect pathogen signature with up to 89....
Rheology of protein-stabilised emulsion gels envisioned as composite networks. 2 -- Framework for the study of emulsion gels.
EN: The aggregation of protein-stabilised emulsions leads to the formation of emulsion gels. These soft solids are classically envisioned as droplet-filled matrices. Here however, it is assumed that protein-coated sub-micron droplets contribute to the network formation in a similar way to proteins. Emulsion gels are thus envisioned as composite networks made of proteins and droplets. Emulsion gels with a wide range of composition are prepared and their viscoelasticity and frequency dependence are measured. Their rheological behaviours are then analysed and compared with the properties of pure gels presented in the first part of this study. The rheological behaviour of emulsion gels is shown to depend mostly on the total volume fraction, while the composition of the gel indicates its level of similarity with either pure droplet gels or pure protein gels. These results converge to form an emerging picture of protein-stabilised emulsion gel as intermediate between droplet and protein gels. This justifies a posteriori the hypothesis of composite networks, and opens the road for the formulation of emulsion gels with fine-tuned rheology.
Nonequilibrium continuous phase transition in colloidal gelation with short-range attraction.
EN: The dynamical arrest of attractive colloidal particles into out-of-equilibrium structures, known as gelation, is central to biophysics, materials science, nanotechnology, and food and cosmetic applications, but a complete understanding is lacking. In particular, for intermediate particle density and attraction, the structure formation process remains unclear. Here, we show that the gelation of short-range attractive particles is governed by a nonequilibrium percolation process. We combine experiments on critical Casimir colloidal suspensions, numerical simulations, and analytical modeling with a master kinetic equation to show that cluster sizes and correlation lengths diverge with exponents 1.6 and 0.8, respectively, consistent with percolation theory, while detailed balance in the particle attachment and detachment processes is broken. Cluster masses exhibit power-law distributions with exponents -3/2 and -5/2 before and after percolation, as predicted by solutions to the master kinetic equation. These results revealing a nonequilibrium continuous phase transition unify the structural arrest and yielding into related frameworks.
The natural polyphenol fortunellin and its structural analogs are inhibitors of the SARS-CoV-2 main proteinase dimerization, as revealed by molecular simulation studies.
EN: 3CL-Pro (or M-Pro) is the SARS-CoV-2 main protease, acting as a homodimer, is responsible for the cleavage of the large polyprotein 1ab transcript in proteins acting on viral growth and replication. 3CL-Pro has been one of the most studied SARS-CoV-2 proteins and the subject of therapeutic interventions, targeting its catalytic domain. A number of drug candidates have been reported, including some natural products. Here, we investigated in silico, through binding and molecular dynamics simulations, the natural product space for the identification of candidates of 3CL-Pro dimerization inhibitors. We report that fortunellin (acacetin 7-O-neohesperidoside), a natural flavonoid O-glycoside, is a potent inhibitor of 3CL-Pro dimerization. A search of the ZINC natural products database identified another 16 related molecules, including apilin and rhoifolin, with interesting pharmacological properties. We propose that fortunellin and its structural analogs might be the basis of novel pharmaceuticals and dietary supplements against SARS-CoV-2 induced COVID-19 disease.
Clinical connectivity map for drug repurposing: using laboratory tests to bridge drugs and diseases.
EN: Drug repurposing has attracted increasing attention from both the pharmaceutical industry and the research community. Many existing computational drug repurposing methods rely on preclinical data (e.g., chemical structures, drug targets), resulting in translational problems for clinical trials. In this study, we propose a clinical connectivity map framework for drug repurposing by leveraging laboratory tests to analyze complementarity between drugs and diseases. We establish clinical drug effect vectors (i.e., drug-laboratory test associations) by applying a continuous self-controlled case series model on a longitudinal electronic health record data. We establish clinical disease sign vectors (i.e., disease-laboratory test associations) by applying a Wilcoxon rank sum test on a large-scale national survey data. Finally, we compute a repurposing possibility score for each drug-disease pair by applying a dot product-based scoring function on clinical disease sign vectors and clinical drug effect vectors. We comprehensively evaluate 392 drugs for 6 important chronic diseases (e.g., asthma, coronary heart disease, type 2 diabetes, etc.). We discover not only known associations between ...
BERT Learns (and Teaches) Chemistry.
EN: Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
BERT Learns (and Teaches) Chemistry.
EN: Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
BERT Learns (and Teaches) Chemistry.
EN: Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
BERT Learns (and Teaches) Chemistry.
EN: Modern computational organic chemistry is becoming increasingly data-driven. There remain a large number of important unsolved problems in this area such as product prediction given reactants, drug discovery, and metric-optimized molecule synthesis, but efforts to solve these problems using machine learning have also increased in recent years. In this work, we propose the use of attention to study functional groups and other property-impacting molecular substructures from a data-driven perspective, using a transformer-based model (BERT) on datasets of string representations of molecules and analyzing the behavior of its attention heads. We then apply the representations of functional groups and atoms learned by the model to tackle problems of toxicity, solubility, drug-likeness, and synthesis accessibility on smaller datasets using the learned representations as features for graph convolution and attention models on the graph structure of molecules, as well as fine-tuning of BERT. Finally, we propose the use of attention visualization as a helpful tool for chemistry practitioners and students to quickly identify important substructures in various chemical properties.
Structure and dynamics of DOPC vesicles: A transformation from unilamellar to multilamellar vesicles by n-alkyl-PEO polymer.
EN: We investigate the influence of a non-ionic surfactant like polymer on phospholipid vesicles. Our results from cryogenic transmission electron microscopy (cryo-TEM), dynamic light scattering (DLS), small angle neutron and X-ray scattering (SANS/SAXS), identifies the existence of multilayer vesicles and an increase in size of the vesicles in presence of the polymers. We present a generalized model to obtain the bending rigidity from neutron spin echo spectroscopy (NSE) data for multilayer vesicles. We demonstrated that polymers are trapped in the lipid bilayer, causing a partial disruption in the vesicle, which is attributed to the reduction in bending rigidity per unit bilayer. We also observed substantial dampening of the trapped lipid tail motion in presence of the polymer. Our results highlighted the possibilities of using specialized polymers that can disrupt membrane and control their dynamics with possible application in topical drug or nutraceutical formulations.
Interpreting Holographic Molecular Binding Assays with Effective Medium Theory.
EN: Holographic molecular binding assays use holographic video microscopy to directly detect molecules binding to the surfaces of micrometer-scale colloidal beads by monitoring associated changes in the beads' light-scattering properties. Holograms of individual spheres are analyzed by fitting to a generative model based on the Lorenz-Mie theory of light scattering. Each fit yields an estimate of a probe bead's diameter and refractive index with sufficient precision to watch the beads grow as molecules bind. Rather than modeling the molecular-scale coating, however, these fits use effective medium theory, treating the coated sphere as if it were homogeneous. This effective-sphere analysis is rapid and numerically robust and so is useful for practical implementations of label-free immunoassays. Here, we assess how effective-sphere properties reflect the properties of molecular-scale coatings by modeling coated spheres with the discrete-dipole approximation and analyzing their holograms with the effective-sphere model.
Research and development of MolAICal for drug design via deep learning and classical programming.
EN: Deep learning methods have permeated into the research area of computer-aided drug design. The deep learning generative model and classical algorithm can be simultaneously used for three-dimensional (3D) drug design in the 3D pocket of the receptor. Here, three aspects of MolAICal are illustrated for drug design: in the first part, the MolAICal uses the genetic algorithm, Vinardo score and deep learning generative model trained by generative adversarial net (GAN) for drug design. In the second part, the deep learning generative model is trained by drug-like molecules from the drug database such as ZINC database. The MolAICal invokes the deep learning generative model and molecular docking for drug virtual screening automatically. In the third part, the useful drug tools are added for calculating the relative properties such as Pan-assay interference compounds (PAINS), Lipinski's rule of five, synthetic accessibility (SA), and so on. Besides, the structural similarity search and quantitative structure-activity relationship (QSAR), etc are also embedded for the calculations of drug properties in the MolAICal. MolAICal will constantly optimize and develop the current and new modules f...
Mucin-inspired, high molecular weight virus binding inhibitors show biphasic binding behavior to influenza A viruses.
EN: Multivalent virus binding inhibitors are a promising new class of antivirals, preventing virus infection of cells by inhibiting the first step in the viral infection cycle - binding of viruses to the cell surface. The design of multivalent virus binding inhibitors is complex as many properties, such as inhibitor size and functionalization with virus attachment factors, have a strong impact on the inhibition efficiency. In this study, we synthesized virus binding inhibitors, the design of which has been inspired by mucins, which are naturally occurring glycosylated proteins with molecular weights in the MDa range and which show high affinity in the interaction with various viruses. Hyperbranched polyglycerols (hPG), serving as polymeric scaffolds, were functionalized with sialic acids and sulfate groups at degrees of functionalization as suggested from the structure of mucins. The molecular weights of the hPG-based inhibitors ranged between 10 and 2600 kDa, thereby hitting the size of mucins (MDa scale) and allowing for comparing the inhibition efficiency of the largest, mucin-sized inhibitor (2600 kDa) with related inhibitors of lower molecular weight. Inhibition efficiencies were ...
HMIC: Hierarchical Medical Image Classification, A Deep Learning Approach.
EN: Image classification is central to the big data revolution in medicine. Improved information processing methods for diagnosis and classification of digital medical images have shown to be successful via deep learning approaches. As this field is explored, there are limitations to the performance of traditional supervised classifiers. This paper outlines an approach that is different from the current medical image classification tasks that view the issue as multi-class classification. We performed a hierarchical classification using our Hierarchical Medical Image classification (HMIC) approach. HMIC uses stacks of deep learning models to give particular comprehension at each level of the clinical picture hierarchy. For testing our performance, we use biopsy of the small bowel images that contain three categories in the parent level (Celiac Disease, Environmental Enteropathy, and histologically normal controls). For the child level, Celiac Disease Severity is classified into 4 classes (I, IIIa, IIIb, and IIIC).
An optimizable scalar objective value cannot be objective and should not be the sole objective.
EN: This paper concerns the ethics and morality of algorithms and computational systems, and has been circulating internally at Facebook for the past couple years. The paper reviews many Nobel laureates' work, as well as the work of other prominent scientists such as Richard Dawkins, Andrei Kolmogorov, Vilfredo Pareto, and John von Neumann. The paper draws conclusions based on such works, as summarized in the title. The paper argues that the standard approach to modern machine learning and artificial intelligence is bound to be biased and unfair, and that longstanding traditions in the professions of law, justice, politics, and medicine should help.
In silico identification of potential natural product inhibitors of human proteases key to SARS-CoV-2 infection.
EN: Presently, there are no approved drugs or vaccines to treat COVID-19 which has spread to over 200 countries and is responsible for over 3,65,000 deaths worldwide. Recent studies have shown that two human proteases, TMPRSS2 and cathepsin L, play a key role in host cell entry of SARS-CoV-2. Importantly, inhibitors of these proteases were shown to block SARS-CoV-2 infection. Here, we perform virtual screening of 14010 phytochemicals produced by Indian medicinal plants to identify natural product inhibitors of TMPRSS2 and cathepsin L. We built a homology model of TMPRSS2 as an experimentally determined structure is not available. AutoDock Vina was used to perform molecular docking of phytochemicals against TMPRSS2 model structure and cathepsin L crystal structure. Potential phytochemical inhibitors were filtered by comparing their docked binding energies with those of known inhibitors of TMPRSS2 and cathepsin L. Further, the ligand binding site residues and non-covalent protein-ligand interactions were used as an additional filter to identify phytochemical inhibitors that either bind to or form interactions with residues important for the specificity of the target proteases. We have id...
In Silico Investigation of Phytoconstituents from Indian Medicinal Herb 'Tinospora cordifolia (Giloy)' against SARS-CoV-2 (COVID-19) by Molecular Dynamics Approach.
EN: The recent appearance of COVID-19 virus has created a global crisis due to unavailability of any vaccine or drug that can effectively and deterministically work against it. Naturally, different possibilities (including herbal medicines having known therapeutic significance) have been explored by the scientists. The systematic scientific study (beginning with in silico study) of herbal medicines in particular and any drug in general is now possible as the structural components (proteins) of COVID-19 are already characterized. The main protease of COVID-19 virus is $\rm{M^{pro}}$ or $\rm{3CL^{pro}}$ which is a key CoV enzyme and an attractive drug target as it plays a pivotal role in mediating viral replication and transcription. In the present study, $\rm{3CL^{pro}}$ is used to study drug:3CLpro interactions and thus to investigate whether all or any of the main chemical constituents of Tinospora cordifolia (e.g., berberine $\rm{(C_{20}H_{18}NO_{4})}$, $β$-sitosterol $\rm{(C_{29}H_{50}O)}$, choline $\rm{(C_{5}H_{14}NO)}$, tetrahydropalmatine $\rm{(C_{21}H_{25}NO_{4})}$ and octacosanol $\rm{(C_{28}H_{58}O))}$ can be used as an anti-viral drug against SARS-CoV-2. The in silico study p...
Protein-ligand interaction study to identify potential dietary compounds binding at the active site of therapeutic target proteins of SARS-CoV-2.
EN: Objective: Total 186 biologically important phenylpropanoids and polyketides compounds from different Indian medicinal plants and dietary sources were screened to filter potential compounds that bind at the active site of the therapeutic target proteins of SARS-CoV-2. Method: The molecular docking studies were carried out by using the Autodock Vina. The in silico ADMET and drug-likeness properties of the compounds were predicted from SwissADME server. Result: The molecular docking study of the 186 compounds with the therapeutic target proteins (Mpro, PLpro, RdRp, SGp and ACE2) of SARS-CoV-2 resulted 40 compounds that bind at the active site with dock score above -8.0 kcal/mol. Conclusion: Based on the in silico ADMET study and drug-likeness prediction of 40 compounds, we proposed petunidin, baicalein, cyanidin, 7-hydroxy-3',4'-methylenedioxyflavan, quercetin and ellagic acid among the 186 biologically important phenylpropanoids and polyketides as potential lead compounds, which can further be investigated pharmacologically and clinically to formulate therapeutic approaches for the COVID-19.
The Skincare project, an interactive deep learning system for differential diagnosis of malignant skin lesions. Technical Report.
EN: A shortage of dermatologists causes long wait times for patients who seek dermatologic care. In addition, the diagnostic accuracy of general practitioners has been reported to be lower than the accuracy of artificial intelligence software. This article describes the Skincare project (H2020, EIT Digital). Contributions include enabling technology for clinical decision support based on interactive machine learning (IML), a reference architecture towards a Digital European Healthcare Infrastructure (also cf. EIT MCPS), technical components for aggregating digitised patient information, and the integration of decision support technology into clinical test-bed environments. However, the main contribution is a diagnostic and decision support system in dermatology for patients and doctors, an interactive deep learning system for differential diagnosis of malignant skin lesions. In this article, we describe its functionalities and the user interfaces to facilitate machine learning from human input. The baseline deep learning system, which delivers state-of-the-art results and the potential to augment general practitioners and even dermatologists, was developed and validated using de-identi...
Spontaneous Formation of Double Emulsions at Particle-Laden Interfaces.
EN: Double emulsions, due to their compartmental structures, are essential in food, agricultural, and pharmaceutical applications. Traditionally, double emulsifications rely on the presence of both oil-soluble and water-soluble surfactants or external stimuli responsive materials and require sequential droplet formation settings or unique fluidic designs. We report on unusual phenomenon where double emulsions are spontaneously formed as soon an aqueous nanoparticle dispersion is placed in contact with an oleic micellar solution. Nanoscale water droplets nucleate in oil in the form of swollen micelles. Nanoparticles form a water-shell encapsulating the saturated oil phase with swollen micelles over time. Remarkably, we find that the gradual surface-activation of nanoparticles is key in self-double emulsification and controlling the emulsion intensity. We build on this new discovery and design a novel system for double emulsion formation. This approach is a scalable self-sequential strategy for preparing core-shell double emulsions that disperses nanoparticles in the opposite phase by employing micelles as transport vehicles. Incorporating nanoparticles into spontaneous emulsification sy...
In silico ADMET and molecular docking study on searching potential inhibitors from limonoids and triterpenoids for COVID-19.
EN: Virtual screening of phytochemicals was performed through molecular docking, simulation, in silico ADMET and drug-likeness prediction to identify the potential hits that can inhibit the effects of SARS-CoV-2. Considering the published literature on medicinal importance, total 154 phytochemicals with analogous structure from limonoids and triterpenoids were selected to search potential inhibitors for the five therapeutic protein targets of SARS-CoV-2, i.e., 3CLpro (main protease), PLpro (papain-like protease), SGp-RBD (spike glycoprotein-receptor binding domain), RdRp (RNA dependent RNA polymerase) and ACE2 (angiotensin-converting enzyme 2). The in silico computational results revealed that the phytochemicals such as glycyrrhizic acid, limonin, 7-deacetyl-7-benzoylgedunin, maslinic acid, corosolic acid, obacunone and ursolic acid were found to be effective against the target proteins of SARS-CoV-2. The protein-ligand interaction study revealed that these phytochemicals bind with the amino acid residues at the active site of the target proteins. Therefore, the core structure of these potential hits can be used for further lead optimization to design drugs for SARS-CoV-2. Also, the me...
In silico ADMET and molecular docking study on searching potential inhibitors from limonoids and triterpenoids for COVID-19.
EN: Virtual screening of phytochemicals was performed through molecular docking, simulation, in silico ADMET and drug-likeness prediction to identify the potential hits that can inhibit the effects of SARS-CoV-2. Considering the published literature on medicinal importance, total 154 phytochemicals with analogous structure from limonoids and triterpenoids were selected to search potential inhibitors for the five therapeutic protein targets of SARS-CoV-2, i.e., 3CLpro (main protease), PLpro (papain-like protease), SGp-RBD (spike glycoprotein-receptor binding domain), RdRp (RNA dependent RNA polymerase) and ACE2 (angiotensin-converting enzyme 2). The in silico computational results revealed that the phytochemicals such as glycyrrhizic acid, limonin, 7-deacetyl-7-benzoylgedunin, maslinic acid, corosolic acid, obacunone and ursolic acid were found to be effective against the target proteins of SARS-CoV-2. The protein-ligand interaction study revealed that these phytochemicals bind with the amino acid residues at the active site of the target proteins. Therefore, the core structure of these potential hits can be used for further lead optimization to design drugs for SARS-CoV-2. Also, the me...
BIOMRC: A Dataset for Biomedical Machine Reading Comprehension.
EN: We introduce BIOMRC, a large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018). Experiments show that simple heuristics do not perform well on the new dataset, and that two neural MRC models that had been tested on BIOREAD perform much better on BIOMRC, indicating that the new dataset is indeed less noisy or at least that its task is more feasible. Non-expert human performance is also higher on the new dataset compared to BIOREAD, and biomedical experts perform even better. We also introduce a new BERT-based MRC model, the best version of which substantially outperforms all other methods tested, reaching or surpassing the accuracy of biomedical experts in some experiments. We make the new dataset available in three different sizes, also releasing our code, and providing a leaderboard.
BIOMRC: A Dataset for Biomedical Machine Reading Comprehension.
EN: We introduce BIOMRC, a large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018). Experiments show that simple heuristics do not perform well on the new dataset, and that two neural MRC models that had been tested on BIOREAD perform much better on BIOMRC, indicating that the new dataset is indeed less noisy or at least that its task is more feasible. Non-expert human performance is also higher on the new dataset compared to BIOREAD, and biomedical experts perform even better. We also introduce a new BERT-based MRC model, the best version of which substantially outperforms all other methods tested, reaching or surpassing the accuracy of biomedical experts in some experiments. We make the new dataset available in three different sizes, also releasing our code, and providing a leaderboard.
COVID-19Base: A knowledgebase to explore biomedical entities related to COVID-19.
EN: We are presenting COVID-19Base, a knowledgebase highlighting the biomedical entities related to COVID-19 disease based on literature mining. To develop COVID-19Base, we mine the information from publicly available scientific literature and related public resources. We considered seven topic-specific dictionaries, including human genes, human miRNAs, human lncRNAs, diseases, Protein Databank, drugs, and drug side effects, are integrated to mine all scientific evidence related to COVID-19. We have employed an automated literature mining and labeling system through a novel approach to measure the effectiveness of drugs against diseases based on natural language processing, sentiment analysis, and deep learning. To the best of our knowledge, this is the first knowledgebase dedicated to COVID-19, which integrates such large variety of related biomedical entities through literature mining. Proper investigation of the mined biomedical entities along with the identified interactions among those, reported in COVID-19Base, would help the research community to discover possible ways for the therapeutic treatment of COVID-19.
COVID-19Base: A knowledgebase to explore biomedical entities related to COVID-19.
EN: We are presenting COVID-19Base, a knowledgebase highlighting the biomedical entities related to COVID-19 disease based on literature mining. To develop COVID-19Base, we mine the information from publicly available scientific literature and related public resources. We considered seven topic-specific dictionaries, including human genes, human miRNAs, human lncRNAs, diseases, Protein Databank, drugs, and drug side effects, are integrated to mine all scientific evidence related to COVID-19. We have employed an automated literature mining and labeling system through a novel approach to measure the effectiveness of drugs against diseases based on natural language processing, sentiment analysis, and deep learning. To the best of our knowledge, this is the first knowledgebase dedicated to COVID-19, which integrates such large variety of related biomedical entities through literature mining. Proper investigation of the mined biomedical entities along with the identified interactions among those, reported in COVID-19Base, would help the research community to discover possible ways for the therapeutic treatment of COVID-19.
SkeleDock: A Web Application for Scaffold Docking in PlayMolecule.
EN: SkeleDock is a scaffold docking algorithm which uses the structure of a protein-ligand complex as a template to model the binding mode of a chemically similar system. This algorithm was evaluated in the D3R Grand Challenge 4 pose prediction challenge, where it achieved competitive performance. Furthermore, we show that, if crystallized fragments of the target ligand are available, SkeleDock can outperform rDock docking software at predicting the binding mode. This article also addresses the capacity of this algorithm to model macrocycles and deal with scaffold hopping. SkeleDock can be accessed at https://playmolecule.org/SkeleDock/.
Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain.
EN: Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden patterns and trends in the data, they fail to offer interpretability. Interpretability is a key means to justification which is an integral part when it comes to biomedical applications. We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods. Qualitative and quantitative measurements and metrics for interpretability of word vector representations are provided. For the quantitative evaluation, we introduce an extensive categorized dataset that can be used to quantify interpretability based on category theory. Intrinsic and extrinsic evaluation of the studied methods are also presented. As for the latter, we propose datasets which can be utilized for effective extrinsic evaluation of word vectors in the biomedical domain. Based on our experiments, it is seen that sparse word vectors show far more interpretability while preserving the performance of their original vectors in dow...
Evaluating Sparse Interpretable Word Embeddings for Biomedical Domain.
EN: Word embeddings have found their way into a wide range of natural language processing tasks including those in the biomedical domain. While these vector representations successfully capture semantic and syntactic word relations, hidden patterns and trends in the data, they fail to offer interpretability. Interpretability is a key means to justification which is an integral part when it comes to biomedical applications. We present an inclusive study on interpretability of word embeddings in the medical domain, focusing on the role of sparse methods. Qualitative and quantitative measurements and metrics for interpretability of word vector representations are provided. For the quantitative evaluation, we introduce an extensive categorized dataset that can be used to quantify interpretability based on category theory. Intrinsic and extrinsic evaluation of the studied methods are also presented. As for the latter, we propose datasets which can be utilized for effective extrinsic evaluation of word vectors in the biomedical domain. Based on our experiments, it is seen that sparse word vectors show far more interpretability while preserving the performance of their original vectors in dow...
Strategic Spatiotemporal Vaccine Distribution Increases the Survival Rate in an Infectious Disease like Covid-19.
EN: Covid-19 has caused hundred of thousands of deaths and an economic damage amounting to trillions of dollars, creating a desire for the rapid development of vaccine. Once available, vaccine is gradually produced, evoking the question on how to distribute it best. While official vaccination guidelines largely focus on the question to whom vaccines should be provided first (e.g. to risk groups), here we propose a strategy for their distribution in time and space, which sequentially prioritizes regions with a high local infection growth rate. To demonstrate this strategy, we develop a simple statistical model describing the time-evolution of infection patterns and their response to vaccination, for infectious diseases like Covid-19. For inhomogeneous infection patterns, locally well-mixed populations and basic reproduction numbers $R_0\sim 1.5-4$ the proposed strategy at least halves the number of deaths in our simulations compared to the standard practice of distributing vaccines proportionally to the population density. For $R_0\sim 1$ we still find a significant increase of the survival rate. The proposed vaccine distribution strategy can be further tested in detailed modelling work...
On Interpretability of Deep Learning based Skin Lesion Classifiers using Concept Activation Vectors.
EN: Deep learning based medical image classifiers have shown remarkable prowess in various application areas like ophthalmology, dermatology, pathology, and radiology. However, the acceptance of these Computer-Aided Diagnosis (CAD) systems in real clinical setups is severely limited primarily because their decision-making process remains largely obscure. This work aims at elucidating a deep learning based medical image classifier by verifying that the model learns and utilizes similar disease-related concepts as described and employed by dermatologists. We used a well-trained and high performing neural network developed by REasoning for COmplex Data (RECOD) Lab for classification of three skin tumours, i.e. Melanocytic Naevi, Melanoma and Seborrheic Keratosis and performed a detailed analysis on its latent space. Two well established and publicly available skin disease datasets, PH2 and derm7pt, are used for experimentation. Human understandable concepts are mapped to RECOD image classification model with the help of Concept Activation Vectors (CAVs), introducing a novel training and significance testing paradigm for CAVs. Our results on an independent evaluation set clearly shows that...
Calculation of light transmittance in a film: considerations of the coating geometry, the agent distribution, and its probability density distribution.
EN: Transmittance is an important parameter for various films such as sunscreen films and creams, biofilms, coating materials, etc. Even if amounts of a sunscreen agent are the same, the transmittance greatly changes depending on the coating geometry (CG) and the agent distribution (AD) in the film. In this study, we calculate the transmittance considering CG and AD. In addition, we associate the transmittance with probability density distribution of the thickness of the film. We found analytical and numerical solutions of the transmittance in several model cases. It can be used for prediction of performance of the sunscreen film and for a fair comparative evaluation. Mathematical techniques in calculation of the transmittance are also explained in detail.
Biomedical Entity Representations with Synonym Marginalization.
EN: Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations in their surface forms, normalization of biomedical entities is very challenging. In this paper, we focus on learning representations of biomedical entities solely based on the synonyms of entities. To learn from the incomplete synonyms, we use a model-based candidate selection and maximize the marginal likelihood of the synonyms present in top candidates. Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BioSyn consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.
Biomedical Entity Representations with Synonym Marginalization.
EN: Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations in their surface forms, normalization of biomedical entities is very challenging. In this paper, we focus on learning representations of biomedical entities solely based on the synonyms of entities. To learn from the incomplete synonyms, we use a model-based candidate selection and maximize the marginal likelihood of the synonyms present in top candidates. Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BioSyn consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.
Molecular dynamics study of the competitive binding of hydrogen peroxide and water molecules with the DNA phosphate groups.
EN: The hydrogen peroxide is present in the living cell at small concentrations that increase under the action of the heavy ion beams in the process of anticancer therapy. The interactions of hydrogen peroxide with DNA, proteins and other biological molecules are poorly understood. In the present work the competitive binding of the hydrogen peroxide and water molecules with the DNA double helix backbone has been studied using the molecular dynamics method. The simulations have been carried out for the DNA double helix in a water solution with hydrogen peroxide molecules and Na$^{+}$ counterions. The obtained radial distribution functions of counterions, H$_2$O$_2$ and H$_2$O molecules with respect to the oxygen atoms of DNA phosphate groups have been used for the analysis of the formation of different complexes. The calculated mean residence times show that a hydrogen peroxide molecule stays at least twice as long near the phosphate group (up to 7 ps) than a water molecule (about 3 ps). The hydrogen peroxide molecules form more stable complexes with the phosphate groups of the DNA backbone than water molecules do.
Crystal structures of Fe-gluconate.
EN: Fe-gluconate, Fe(C_6H_11O_7_2xH_2O is a well-known material widely used for iron supplementation. On the other hand, it is used in food industry as a coloring agent, in cosmetic industry for skin and nail conditioning and metallurgy. Despite of wide range of applications its physical properties were not studied extensively. In this study, Fe-gluconate with three different amount of water viz. x=2 (fully hydrated, 0 < x < 2 (intermediate) and x=0 (dry) was investigated by means of X-ray diffraction (XRD) and Mössbauer spectroscopic (MS) methods. The former in the temperature range of 20-300 K, and the latter at 295 K. Based on the XRD measurements crystallographic structures were determined: monoclinic (space group I2) for the hydrated sample and triclinic (space group P1) for the dry sample. The partially hydrated sample was two-phased. Unit cells parameters for both structures show strong, very complex and non-monotonic temperature dependences. Mössbauer spectroscopic measurements gave evidence that iron in all samples exist in form of Fe(II) and Fe(III) ions. The amount of the latter equals to ca.30% in the hydrated sample and to ca.20% in the dry one.
Alleviating the Incompatibility between Cross Entropy Loss and Episode Training for Few-shot Skin Disease Classification.
EN: Skin disease classification from images is crucial to dermatological diagnosis. However, identifying skin lesions involves a variety of aspects in terms of size, color, shape, and texture. To make matters worse, many categories only contain very few samples, posing great challenges to conventional machine learning algorithms and even human experts. Inspired by the recent success of Few-Shot Learning (FSL) in natural image classification, we propose to apply FSL to skin disease identification to address the extreme scarcity of training sample problem. However, directly applying FSL to this task does not work well in practice, and we find that the problem can be largely attributed to the incompatibility between Cross Entropy (CE) and episode training, which are both commonly used in FSL. Based on a detailed analysis, we propose the Query-Relative (QR) loss, which proves superior to CE under episode training and is closely related to recently proposed mutual information estimation. Moreover, we further strengthen the proposed QR loss with a novel adaptive hard margin strategy. Comprehensive experiments validate the effectiveness of the proposed FSL scheme and the possibility to diagno...
Lindemann unjamming of emulsions.
EN: We study the bulk and shear elastic properties of barely-compressed, "athermal" emulsions and find that the rigidity of the jammed solid fails at remarkably large critical osmotic pressures. The minuscule yield strain and similarly small Brownian particle displacement of solid emulsions close to this transition suggests that this catastrophic failure corresponds to a plastic-entropic instability: the solid becomes too soft and weak to resist the thermal agitation of the droplets that compose it and fails. We propose a modified Lindemann stability criterion to describe this transition and derive a scaling law for the critical osmotic pressure that agrees quantitatively with experimental observations.
Searching inhibitors for three important proteins of COVID-19 through molecular docking studies.
EN: The lack of recommended drugs or vaccines to deal with the COVID-19 is the main concern of this pandemic. The approved drugs for similar health problems, drugs under clinical trials, and molecules from medicinal plants extracts are investigated randomly to deal with the COVID-19 infection. Molecular docking, one of the best approach to search therapeutically potent drugs/molecules in real time with possible hope to apply on COVID-19. In this communication, molecular docking studies of 18 ligands were carried out with the three therapeutic target proteins of SARS-CoV-2, i.e., RNA-dependent RNA polymerase (RdRp), angiotensin-converting enzyme 2 (ACE2) and spike glycoprotein (SGp). The obtained results revealed that the phytochemicals showed better dock score in compared to the drugs paracetmol and hydroxychloroquine. Combining the dock score and medicinal properties, we believe the terpenoids based phytochemicals limonin and scopadulcic acid B can be further explored for potential use against COVID-19.
Catastrophic thermal destabilization of two-dimensional close-packed emulsions due to synchronized coalescence initiation.
EN: The mechanisms for phase separation in highly concentrated emulsions when subjected to a thermal phase transition remain to be elucidated. Here, we create a hexagonally close-packed monodisperse emulsion in 2D and show that during a cool-heat cycle, the emulsion fully destabilizes akin to phase separation. The mechanism for this catastrophic destabilization is found to be spontaneous coalescence initiation that synchronously occurs between every solidified droplet and its neighbors. This synchronous coalescence initiation establishes system-wide network connectivity in the emulsion causing large-scale destabilization. This system-wide coalescence initiation is found to be insensitive to droplet size and surfactant type, but dependent on network connectivity and crystal content of individual droplets.
Multilingual enrichment of disease biomedical ontologies.
EN: Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both ontologies, plus Arabic, Chinese and Russian for the second one. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.
Multilingual enrichment of disease biomedical ontologies.
EN: Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects: coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both ontologies, plus Arabic, Chinese and Russian for the second one. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.
NiLBS: Neural Inverse Linear Blend Skinning.
EN: In this technical report, we investigate efficient representations of articulated objects (e.g. human bodies), which is an important problem in computer vision and graphics. To deform articulated geometry, existing approaches represent objects as meshes and deform them using "skinning" techniques. The skinning operation allows a wide range of deformations to be achieved with a small number of control parameters. This paper introduces a method to invert the deformations undergone via traditional skinning techniques via a neural network parameterized by pose. The ability to invert these deformations allows values (e.g., distance function, signed distance function, occupancy) to be pre-computed at rest pose, and then efficiently queried when the character is deformed. We leave empirical evaluation of our approach to future work.
CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models.
EN: The novel nature of SARS-CoV-2 calls for the development of efficient de novo drug design approaches. In this study, we propose an end-to-end framework, named CogMol (Controlled Generation of Molecules), for designing new drug-like small molecules targeting novel viral proteins with high affinity and off-target selectivity. CogMol combines adaptive pre-training of a molecular SMILES Variational Autoencoder (VAE) and an efficient multi-attribute controlled sampling scheme that uses guidance from attribute predictors trained on latent features. To generate novel and optimal drug-like molecules for unseen viral targets, CogMol leverages a protein-molecule binding affinity predictor that is trained using SMILES VAE embeddings and protein sequence embeddings learned unsupervised from a large corpus. CogMol framework is applied to three SARS-CoV-2 target proteins: main protease, receptor-binding domain of the spike protein, and non-structural protein 9 replicase. The generated candidates are novel at both molecular and chemical scaffold levels when compared to the training data. CogMol also includes insilico screening for assessing toxicity of parent molecules and their metabolites with ...
DeepGS: Deep Representation Learning of Graphs and Sequences for Drug-Target Binding Affinity Prediction.
EN: Accurately predicting drug-target binding affinity (DTA) in silico is a key task in drug discovery. Most of the conventional DTA prediction methods are simulation-based, which rely heavily on domain knowledge or the assumption of having the 3D structure of the targets, which are often difficult to obtain. Meanwhile, traditional machine learning-based methods apply various features and descriptors, and simply depend on the similarities between drug-target pairs. Recently, with the increasing amount of affinity data available and the success of deep representation learning models on various domains, deep learning techniques have been applied to DTA prediction. However, these methods consider either label/one-hot encodings or the topological structure of molecules, without considering the local chemical context of amino acids and SMILES sequences. Motivated by this, we propose a novel end-to-end learning framework, called DeepGS, which uses deep neural networks to extract the local chemical context from amino acids and SMILES sequences, as well as the molecular structure from the drugs. To assist the operations on the symbolic data, we propose to use advanced embedding techniques (i.e...
Apollonian Emulsions.
EN: We have discovered the existence of extremely polydisperse High Internal-Phase-Ratio Emulsions (HIPE) in which the internal-phase droplets, present at 95% volume fraction, remain spherical and organize themselves in the available space according to Apollonian packing rules. Such Apollonian emulsions are obtained from dispersing oil dropwise in water in the presence of very little surfactant, and allowing them to evolve at rest for a week. The packing structure of the droplets was confirmed through size distribution measurements that evolved spontaneously towards power laws with the known Apollonian exponents, as well as comparison of the structure factors of aged HIPEs measured by Small-Angle X-ray Scattering with that of a numerically simulated Random Apollonian Packing. Thanks to the perfect sphericity of the droplets, Apollonian emulsions were found to display Newtonian ow even at such extremely high volume fraction. We argue that these fascinating space-filling assemblies of spherical droplets are a result of coalescence and fragmentation processes obeying simple geometrical rules of conserving total volume and sphericity, minimizing the elastic energy associated with interacti...
Bimolecular binding rates for pairs of spherical molecules with small binding sites.
EN: Bimolecular binding rate constants are often used to describe the association of large molecules, such as proteins. In this paper, we analyze a model for such binding rates that includes the fact that pairs of molecules can bind only in certain orientations. The model considers two spherical molecules, each with an arbitrary number of small binding sites on their surface, and the two molecules bind if and only if their binding sites come into contact (such molecules are often called "patchy particles" in the biochemistry literature). The molecules undergo translational and rotational diffusion, and the binding sites are allowed to diffuse on their surfaces. Mathematically, the model takes the form of a high-dimensional, anisotropic diffusion equation with mixed boundary conditions. We apply matched asymptotic analysis to derive the bimolecular binding rate in the limit of small, well-separated binding sites. The resulting binding rate formula involves a factor that depends on the electrostatic capacitance of a certain four-dimensional region embedded in five dimensions. We compute this factor numerically by modifying a recent kinetic Monte Carlo algorithm. We then apply a quasi che...
Flowing emulsions through disorder: Critical depinning and smectic rivers.
EN: During the past sixty minutes only, oil companies have extracted six trillions liters of oil from the ground, i.e. the volume of about two hundreds Olympic swimming pools. This phenomenal number gives a striking illustration of the impact of multiphase flows on the world economy and environment. From a fundamental perspective, we now clearly understand the large-scale patterns formed when liquid interfaces are driven through heterogeneous environments. In stark contrast, the displacement of fragmented fluids through disordered media remains limited to isolated droplets and bubbles. Here, we elucidate the collective dynamics of emulsions hydrodynamically driven through disordered environments. Advecting hundreds of thousands of microfluidic droplets through random lattices of pinning sites, we establish that the mobilization of confined emulsions is a critical dynamical transition. Unlike contact-line depinning, emulsion mobilization is not triggered by large-scale avalanches but merely requires the coordinated motion of small groups of particles. Criticality arises from the correlations of seemingly erratic depinning events over system-spanning scales along smectic river networks. ...
Syndrome-aware Herb Recommendation with Multi-Graph Convolution Network.
EN: Herb recommendation plays a crucial role in the therapeutic process of Traditional Chinese Medicine(TCM), which aims to recommend a set of herbs to treat the symptoms of a patient. While several machine learning methods have been developed for herb recommendation, they are limited in modeling only the interactions between herbs and symptoms, and ignoring the intermediate process of syndrome induction. When performing TCM diagnostics, an experienced doctor typically induces syndromes from the patient's symptoms and then suggests herbs based on the induced syndromes. As such, we believe the induction of syndromes, an overall description of the symptoms, is important for herb recommendation and should be properly handled. However, due to the ambiguity and complexity of syndrome induction, most prescriptions lack the explicit ground truth of syndromes. In this paper, we propose a new method that takes the implicit syndrome induction process into account for herb recommendation. Given a set of symptoms to treat, we aim to generate an overall syndrome representation by effectively fusing the embeddings of all the symptoms in the set, to mimic how a doctor induces the syndromes. Towards s...
CBAG: Conditional Biomedical Abstract Generation.
EN: Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scientific writing assistants, chat-bots, or descriptive hypothesis generation systems, require new domain-centered approaches. A conditional language model, one that learns the probability of words given some a priori criteria, is a fundamental building block in many such applications. We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks. The condition stack encodes metadata used to alter the output probability distribution of the language model stack. We sample this distribution in order to generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords. Using typical natural language generation metrics, we demonstrate that this propo...
CBAG: Conditional Biomedical Abstract Generation.
EN: Biomedical research papers use significantly different language and jargon when compared to typical English text, which reduces the utility of pre-trained NLP models in this domain. Meanwhile Medline, a database of biomedical abstracts, introduces nearly a million new documents per-year. Applications that could benefit from understanding this wealth of publicly available information, such as scientific writing assistants, chat-bots, or descriptive hypothesis generation systems, require new domain-centered approaches. A conditional language model, one that learns the probability of words given some a priori criteria, is a fundamental building block in many such applications. We propose a transformer-based conditional language model with a shallow encoder "condition" stack, and a deep "language model" stack of multi-headed attention blocks. The condition stack encodes metadata used to alter the output probability distribution of the language model stack. We sample this distribution in order to generate biomedical abstracts given only a proposed title, an intended publication year, and a set of keywords. Using typical natural language generation metrics, we demonstrate that this propo...
A Survey on Causal Inference.
EN: Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well known causal inference framework. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summar...
Origin of the extremely high elasticity of bulk emulsions, stabilized by Yucca Schidigera saponins.
EN: We found experimentally that the elasticity of sunflower oil-in-water emulsions (SFO-in-W) stabilized by Yucca Schidigera Roezl saponin extract, is by more than 50 times higher as compared to the elasticity of common emulsions. We revealed that strong specific interactions between the phytosterols from the non-purified oil and the saponins from the Yucca extract lead to the formation of nanostructured adsorption layers which are responsible for the very high elasticity of the oil-water interface and of the respective bulk emulsions. Remarkably, this extra high emulsion elasticity inhibits the emulsion syneresis even at 65 vol % of the oil drops. These emulsions remain homogeneous and stable even after 30 days of shelf-storage. These results demonstrate that the combination of saponin and phytosterols is a powerful new approach to structure oil-in-water emulsions with potential applications for formulating healthier functional food.
Molecular Asymmetry and Optical Cycling: Laser Cooling Asymmetric Top Molecules.
EN: We present a practical roadmap to achieve optical cycling and laser cooling of asymmetric top molecules (ATMs). Our theoretical analysis describes how reduced molecular symmetry, as compared to diatomic and symmetric non-linear molecules, plays a role in photon scattering. We present methods to circumvent limitations on rapid photon cycling in these systems. We calculate vibrational branching ratios for a diverse set of asymmetric top molecules and find that many species within a broad class of molecules can be effectively cooled with a manageable number of lasers. We also describe methods to achieve rotationally closed optical cycles in ATMs. Despite significant structural complexity, laser cooling can be made effective using extensions of the current techniques used for linear molecules. Potential scientific impacts of laser-cooled ATMs span frontiers in controlled chemistry, quantum simulation, and searches for physics beyond the Standard Model.
Molecular Asymmetry and Optical Cycling: Laser Cooling Asymmetric Top Molecules.
EN: We present a practical roadmap to achieve optical cycling and laser cooling of asymmetric top molecules (ATMs). Our theoretical analysis describes how reduced molecular symmetry, as compared to diatomic and symmetric non-linear molecules, plays a role in photon scattering. We present methods to circumvent limitations on rapid photon cycling in these systems. We calculate vibrational branching ratios for a diverse set of asymmetric top molecules and find that many species within a broad class of molecules can be effectively cooled with a manageable number of lasers. We also describe methods to achieve rotationally closed optical cycles in ATMs. Despite significant structural complexity, laser cooling can be made effective using extensions of the current techniques used for linear molecules. Potential scientific impacts of laser-cooled ATMs span frontiers in controlled chemistry, quantum simulation, and searches for physics beyond the Standard Model.
Molecular Asymmetry and Optical Cycling: Laser Cooling Asymmetric Top Molecules.
EN: We present a practical roadmap to achieve optical cycling and laser cooling of asymmetric top molecules (ATMs). Our theoretical analysis describes how reduced molecular symmetry, as compared to diatomic and symmetric non-linear molecules, plays a role in photon scattering. We present methods to circumvent limitations on rapid photon cycling in these systems. We calculate vibrational branching ratios for a diverse set of asymmetric top molecules and find that many species within a broad class of molecules can be effectively cooled with a manageable number of lasers. We also describe methods to achieve rotationally closed optical cycles in ATMs. Despite significant structural complexity, laser cooling can be made effective using extensions of the current techniques used for linear molecules. Potential scientific impacts of laser-cooled ATMs span frontiers in controlled chemistry, quantum simulation, and searches for physics beyond the Standard Model.
On the Polarization of Ligands by Proteins.
EN: Although ligand-binding sites in many proteins contain a high number density of charged side chains that can polarize small organic molecules and influence binding, the magnitude of this effect has not been studied in many systems. Here, we use a quantum mechanics/molecular mechanics (QM/MM) approach in which the ligand is the QM region to compute the ligand polarization energy of 286 protein-ligand complexes from the PDBBind Core Set (release 2016). We observe that the ligand polarization energy is linearly correlated with the magnitude of the electric field acting on the ligand, the magnitude of the induced dipole moment, and the classical polarization energy. The influence of protein and cation charges on the ligand polarization diminishes with the distance and is below 2 kcal/mol at 9 $\unicode{x212B}$ and 1 kcal/mol at 12 $\unicode{x212B}$. Considering both polarization and solvation appears essential to computing negative binding energies in some crystallographic complexes. Solvation, but not polarization, is essential for achieving moderate correlation with experimental binding free energies.
Dietary Restriction of Amino Acids for Cancer Therapy.
EN: Biosyntheses of proteins, nucleotides and fatty acids, are essential for the malignant proliferation and survival of cancer cells. Cumulating research findings show that amino acid restrictions are potential strategies for cancer interventions. Meanwhile, dietary strategies are popular among cancer patients. However, there is still lacking solid rationale to clarify what is the best strategy, why and how it is. Here, integrated analyses and comprehensive summaries for the abundances, signalling and functions of amino acids in proteomes, metabolism, immunity and food compositions, suggest that, intermittent fasting or intermittent dietary lysine restriction with normal maize as an intermittent staple food for days or weeks, might have the value and potential for cancer prevention or therapy. Moreover, dietary supplements were also discussed for cancer cachexia including dietary immunomodulatory.
Assessing Robustness of Deep learning Methods in Dermatological Workflow.
EN: This paper aims to evaluate the suitability of current deep learning methods for clinical workflow especially by focusing on dermatology. Although deep learning methods have been attempted to get dermatologist level accuracy in several individual conditions, it has not been rigorously tested for common clinical complaints. Most projects involve data acquired in well-controlled laboratory conditions. This may not reflect regular clinical evaluation where corresponding image quality is not always ideal. We test the robustness of deep learning methods by simulating non-ideal characteristics on user submitted images of ten classes of diseases. Assessing via imitated conditions, we have found the overall accuracy to drop and individual predictions change significantly in many cases despite of robust training.
Controlled transitions between phyllotactic states of repulsive particles confined on the surface of a cylinder.
EN: Phyllotactic states are regular lattice-like structures on cylinders and are a botanical classification scheme. In this communication, we report a sequence of transitions between phyllotactic states for particles with a repulsive particle-particle interaction on a cylindrical geometry at zero temperature. We can infer the transition points as a function of density via Monte Carlo simulations, as well as the mathematical descriptions of the ground states. The lattices we generate are described as phyllotactic states that fit onto the cylindrical surface as a set of helical chains. Our analysis shows how all state energies lie on the same parabola which we exploit to find the transitions.
Energy-based Graph Convolutional Networks for Scoring Protein Docking Models.
EN: Structural information about protein-protein interactions, often missing at the interactome scale, is important for mechanistic understanding of cells and rational discovery of therapeutics. Protein docking provides a computational alternative to predict such information. However, ranking near-native docked models high among a large number of candidates, often known as the scoring problem, remains a critical challenge. Moreover, estimating model quality, also known as the quality assessment problem, is rarely addressed in protein docking. In this study the two challenging problems in protein docking are regarded as relative and absolute scoring, respectively, and addressed in one physics-inspired deep learning framework. We represent proteins' and encounter complexes' 3D structures as intra- and inter-molecular residue contact graphs with atom-resolution node and edge features. And we propose a novel graph convolutional kernel that pool interacting nodes' features through edge features so that generalized interaction energies can be learned directly from graph data. The resulting energy-based graph convolutional networks (EGCN) with multi-head attention are trained to predict int...
Role of interfacial elasticity for the rheological properties of saponin-stabilized emulsions.
EN: Hypothesis Saponins are natural surfactants which can provide highly viscoelastic interfaces. This property can be used to quantify precisely the effect of interfacial dilatational elasticity on the various rheological properties of bulk emulsions. Experiments We measured the interfacial dilatational elasticity of adsorption layers from four saponins (Quillaja, Escin, Berry, Tea) adsorbed on hexadecane-water and sunflower oil-water interfaces. In parallel, the rheological properties under steady and oscillatory shear deformations were measured for bulk emulsions, stabilized by the same saponins (oil volume fraction between 75 and 85 %). Findings Quillaja saponin and Berry saponin formed solid adsorption layers (shells) on the SFO-water interface. As a consequence, the respective emulsions contained non-spherical drops. For the other systems, the interfacial elasticities varied between 2 mN/m and 500 mN/m. We found that this interfacial elasticity has very significant impact on the emulsion shear elasticity, moderate effect on the dynamic yield stress, and no effect on the viscous stress of the respective steadily sheared emulsions. The last conclusion is not trivial, because the di...
Seizure Prediction Using Bidirectional LSTM.
EN: Approximately, 50 million people in the world are affected by epilepsy. For patients, the anti-epileptic drugs are not always useful and these drugs may have undesired side effects on a patient's health. If the seizure is predicted the patients will have enough time to take preventive measures. The purpose of this work is to investigate the application of bidirectional LSTM for seizure prediction. In this paper, we trained EEG data from canines on a double Bidirectional LSTM layer followed by a fully connected layer. The data was provided in the form of a Kaggle competition by American Epilepsy Society. The main task was to classify the interictal and preictal EEG clips. Using this model, we obtained an AUC of 0.84 on the test dataset. Which shows that our classifier's performance is above chance level on unseen data. The comparison with the previous work shows that the use of bidirectional LSTM networks can achieve significantly better results than SVM and GRU networks.
Decision Support System for Detection and Classification of Skin Cancer using CNN.
EN: Skin Cancer is one of the most deathful of all the cancers. It is bound to spread to different parts of the body on the off chance that it is not analyzed and treated at the beginning time. It is mostly because of the abnormal growth of skin cells, often develops when the body is exposed to sunlight. The Detection Furthermore, the characterization of skin malignant growth in the beginning time is a costly and challenging procedure. It is classified where it develops and its cell type. High Precision and recall are required for the classification of lesions. The paper aims to use MNIST HAM-10000 dataset containing dermoscopy images. The objective is to propose a system that detects skin cancer and classifies it in different classes by using the Convolution Neural Network. The diagnosing methodology uses Image processing and deep learning model. The dermoscopy image of skin cancer taken, undergone various techniques to remove the noise and picture resolution. The image count is also increased by using various image augmentation techniques. In the end, the Transfer Learning method is used to increase the classification accuracy of the images further. Our CNN model gave a weighted aver...
Superposition of droplet elasticity and volume fraction effects on emulsion dynamics.
EN: The rheological properties of emulsions are of considerable importance in a diverse range of scenarios. Here we describe a superposition of the effects of droplet elasticity and volume fraction on the dynamics of emulsions. The superposition is governed by physical interactions between droplets, and provides a new mechanism for modifying the flow behavior of emulsions, by controlling the elasticity of the dispersed phase. We investigate the properties of suspensions of emulsified wormlike micelles (WLM). Dense suspensions of the emulsified WLM droplets exhibit thermally responsive properties in which the viscoelastic moduli decrease by an order of magnitude over a temperature range of 0 $^\circ$C to 25 $^\circ$C. Surprisingly, the fragility (i.e. the volume-fraction dependence of the modulus) of the emulsions does not change with temperature. Instead, the emulsion modulus scales as a power-law with volume fraction with a constant exponent across all temperatures even as the droplet properties change from elastic to viscous. Nevertheless, the underlying droplet dynamics depend strongly on temperature. From stress relaxation experiments, we quantify droplet dynamics across the cage b...
DeepAtom: A Framework for Protein-Ligand Binding Affinity Prediction.
EN: The cornerstone of computational drug design is the calculation of binding affinity between two biological counterparts, especially a chemical compound, i.e., a ligand, and a protein. Predicting the strength of protein-ligand binding with reasonable accuracy is critical for drug discovery. In this paper, we propose a data-driven framework named DeepAtom to accurately predict the protein-ligand binding affinity. With 3D Convolutional Neural Network (3D-CNN) architecture, DeepAtom could automatically extract binding related atomic interaction patterns from the voxelized complex structure. Compared with the other CNN based approaches, our light-weight model design effectively improves the model representational capacity, even with the limited available training data. With validation experiments on the PDBbind v.2016 benchmark and the independent Astex Diverse Set, we demonstrate that the less feature engineering dependent DeepAtom approach consistently outperforms the other state-of-the-art scoring methods. We also compile and propose a new benchmark dataset to further improve the model performances. With the new dataset as training input, DeepAtom achieves Pearson's R=0.83 and RMSE=1...
Depletion attraction favors the elastic response of emulsions flowing in a constriction.
EN: We study the elasto-plastic behavior of dense attractive emulsions under mechanical perturbation. The attraction is introduced through non-specific depletion interactions between the droplets and is controlled by changing the concentration of surfactant micelles in the continuous phase. We find that such attractive forces are not sufficient to induce any measurable modification on the scalings between the local packing fraction and the deformation of the droplets. However, when the emulsions are flown through 2D microfluidic constrictions, we uncover a measurable effect of attraction on their elasto-plastic response. Indeed, we measure higher levels of deformation inside the constriction for attractive droplets. In addition, we show that these measurements correlate with droplet rearrangements that are spatially delayed in the constriction for higher attraction forces.
Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles.
EN: Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces. We develop this task formally as multi-view alignment problem, and present a coordinated deep learning approach that jointly optimizes representations of both chemical structure and perturbational gene expression profiles. We benchmark our results against oracle models and principled baselines, and find that cell line variability markedly influences performance in this domain. Our work establishes the feasibility of this new task, elucidates the limitations of current data and systems, and may serve to catalyze future research in small molecule representation learning.
Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles.
EN: Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces. We develop this task formally as multi-view alignment problem, and present a coordinated deep learning approach that jointly optimizes representations of both chemical structure and perturbational gene expression profiles. We benchmark our results against oracle models and principled baselines, and find that cell line variability markedly influences performance in this domain. Our work establishes the feasibility of this new task, elucidates the limitations of current data and systems, and may serve to catalyze future research in small molecule representation learning.
LATTE: Latent Type Modeling for Biomedical Entity Linking.
EN: Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor's notes that has been annotated with ICD concepts. Extensive expe...
LATTE: Latent Type Modeling for Biomedical Entity Linking.
EN: Entity linking is the task of linking mentions of named entities in natural language text, to entities in a curated knowledge-base. This is of significant importance in the biomedical domain, where it could be used to semantically annotate a large volume of clinical records and biomedical literature, to standardized concepts described in an ontology such as Unified Medical Language System (UMLS). We observe that with precise type information, entity disambiguation becomes a straightforward task. However, fine-grained type information is usually not available in biomedical domain. Thus, we propose LATTE, a LATent Type Entity Linking model, that improves entity linking by modeling the latent fine-grained type information about mentions and entities. Unlike previous methods that perform entity linking directly between the mentions and the entities, LATTE jointly does entity disambiguation, and latent fine-grained type learning, without direct supervision. We evaluate our model on two biomedical datasets: MedMentions, a large scale public dataset annotated with UMLS concepts, and a de-identified corpus of dictated doctor's notes that has been annotated with ICD concepts. Extensive expe...
Additive Bayesian Network Modelling with the R Package abn.
EN: The R package abn is designed to fit additive Bayesian models to observational datasets. It contains routines to score Bayesian networks based on Bayesian or information theoretic formulations of generalized linear models. It is equipped with exact search and greedy search algorithms to select the best network. It supports a possible blend of continuous, discrete and count data and input of prior knowledge at a structural level. The Bayesian implementation supports random effects to control for one-layer clustering. In this paper, we give an overview of the methodology and illustrate the package's functionalities using a veterinary dataset about respiratory diseases in commercial swine production.
DermGAN: Synthetic Generation of Clinical Skin Images with Pathology.
EN: Despite the recent success in applying supervised deep learning to medical imaging tasks, the problem of obtaining large and diverse expert-annotated datasets required for the development of high performant models remains particularly challenging. In this work, we explore the possibility of using Generative Adverserial Networks (GAN) to synthesize clinical images with skin condition. We propose DermGAN, an adaptation of the popular Pix2Pix architecture, to create synthetic images for a pre-specified skin condition while being able to vary its size, location and the underlying skin color. We demonstrate that the generated images are of high fidelity using objective GAN evaluation metrics. In a Human Turing test, we note that the synthetic images are not only visually similar to real images, but also embody the respective skin condition in dermatologists' eyes. Finally, when using the synthetic images as a data augmentation technique for training a skin condition classifier, we observe that the model performs comparably to the baseline model overall while improving on rare but malignant conditions.
Computer-Aided Clinical Skin Disease Diagnosis Using CNN and Object Detection Models.
EN: Skin disease is one of the most common types of human diseases, which may happen to everyone regardless of age, gender or race. Due to the high visual diversity, human diagnosis highly relies on personal experience; and there is a serious shortage of experienced dermatologists in many countries. To alleviate this problem, computer-aided diagnosis with state-of-the-art (SOTA) machine learning techniques would be a promising solution. In this paper, we aim at understanding the performance of convolutional neural network (CNN) based approaches. We first build two versions of skin disease datasets from Internet images: (a) Skin-10, which contains 10 common classes of skin disease with a total of 10,218 images; (b) Skin-100, which is a larger dataset that consists of 19,807 images of 100 skin disease classes. Based on these datasets, we benchmark several SOTA CNN models and show that the accuracy of skin-100 is much lower than the accuracy of skin-10. We then implement an ensemble method based on several CNN models and achieve the best accuracy of 79.01\% for Skin-10 and 53.54\% for Skin-100. We also present an object detection based approach by introducing bounding boxes into the Skin-...
Biological Value of Centaurea damascena: Minireview.
EN: The family Asteraceae include large number of Centaurea species which have been applied in folk medicine. One of the family Asteraceae members is the Centaurea damascena which authentically been tested for its antibacterial activity. The aim of the study was to discuss antibacterial activities of essential oil composition and methanolic extract of the same plant aerial part leaves. Thirty-seven components were characterized with 86 of oxygenated terpenes. The composition in percentage was dominated by 11.45 Fokienol, 8.8 thymol, 8.21 Alpha Terpineol, 7.24 Chrysanthemumic acid, 7.13 Terpinen4-ol and 6.59 Borneol with a high degree of polymorphism in the occurrence of these compounds as compared with the different species of centaurea.. Free radical scavenging capacity of the C. damascna methanol extract was calculated by DPPH and FRAP test. DPPH radicals were scavenged with an IC50 value of 17.08 microgram per ml. Antioxidant capacities obtained by the FRAP was 51.9 and expressed in mg Trolox gram per Liter dry weight. The total phenolic compounds of the methanol extracts of aerial parts, as estimated by Folin Ciocalteu reagent method, was about 460 milligram GAE per gram. The pheno...
A Discreet Wearable IoT Sensor for Continuous Transdermal Alcohol Monitoring -- Challenges and Opportunities.
EN: Non-invasive continuous alcohol monitoring has potential applications in both population research and in clinical management of acute alcohol intoxication or chronic alcoholism. Current wearable monitors based on transdermal alcohol content (TAC) sensing are relatively bulky and have limited quantification accuracy. Here we describe the development of a discreet wearable transdermal alcohol (TAC) sensor in the form of a wristband or armband. This novel sensor can detect vapor-phase alcohol in perspiration from 0.09 ppm (equivalent to 0.09 mg/dL sweat alcohol concentration at 25 °C under Henry's Law equilibrium) to over 500 ppm at one-minute time resolution. The TAC sensor is powered by a 110 mAh lithium battery that lasts for over 7 days. In addition, the sensor can function as a medical "internet-of-things" (IoT) device by connecting to an Android smartphone gateway via Bluetooth Low Energy (BLE) and upload data to a cloud informatics system. Such wearable IoT sensors may enable large-scale alcohol-related research and personalized management. We also present evidence suggesting a hypothesis that perspiration rate is the dominant factor leading to TAC measurement variabilities, wh...
Molecular polaritons for controlling chemistry with quantum optics.
EN: This is a tutorial-style introduction to the field of molecular polaritons. We describe the basic physical principles and consequences of strong light-matter coupling common to molecular ensembles embedded in UV-visible or infrared cavities. Using a microscopic quantum electrodynamics formulation, we discuss the competition between the collective cooperative dipolar response of a molecular ensemble and local dynamical processes that molecules typically undergo, including chemical reactions. We highlight some of the observable consequences of this competition between local and collective effects in linear transmission spectroscopy, including the formal equivalence between quantum mechanical theory and the classical transfer matrix method, under specific conditions of molecular density and indistinguishability. We also overview recent experimental and theoretical developments on strong and ultrastrong coupling with electronic and vibrational transitions, with a special focus on cavity-modified chemistry and infrared spectroscopy under vibrational strong coupling. We finally suggest several opportunities for further studies that may lead to novel applications in chemical and electroma...
Molecular polaritons for controlling chemistry with quantum optics.
EN: This is a tutorial-style introduction to the field of molecular polaritons. We describe the basic physical principles and consequences of strong light-matter coupling common to molecular ensembles embedded in UV-visible or infrared cavities. Using a microscopic quantum electrodynamics formulation, we discuss the competition between the collective cooperative dipolar response of a molecular ensemble and local dynamical processes that molecules typically undergo, including chemical reactions. We highlight some of the observable consequences of this competition between local and collective effects in linear transmission spectroscopy, including the formal equivalence between quantum mechanical theory and the classical transfer matrix method, under specific conditions of molecular density and indistinguishability. We also overview recent experimental and theoretical developments on strong and ultrastrong coupling with electronic and vibrational transitions, with a special focus on cavity-modified chemistry and infrared spectroscopy under vibrational strong coupling. We finally suggest several opportunities for further studies that may lead to novel applications in chemical and electroma...
Molecular polaritons for controlling chemistry with quantum optics.
EN: This is a tutorial-style introduction to the field of molecular polaritons. We describe the basic physical principles and consequences of strong light-matter coupling common to molecular ensembles embedded in UV-visible or infrared cavities. Using a microscopic quantum electrodynamics formulation, we discuss the competition between the collective cooperative dipolar response of a molecular ensemble and local dynamical processes that molecules typically undergo, including chemical reactions. We highlight some of the observable consequences of this competition between local and collective effects in linear transmission spectroscopy, including the formal equivalence between quantum mechanical theory and the classical transfer matrix method, under specific conditions of molecular density and indistinguishability. We also overview recent experimental and theoretical developments on strong and ultrastrong coupling with electronic and vibrational transitions, with a special focus on cavity-modified chemistry and infrared spectroscopy under vibrational strong coupling. We finally suggest several opportunities for further studies that may lead to novel applications in chemical and electroma...
Molecular polaritons for controlling chemistry with quantum optics.
EN: This is a tutorial-style introduction to the field of molecular polaritons. We describe the basic physical principles and consequences of strong light-matter coupling common to molecular ensembles embedded in UV-visible or infrared cavities. Using a microscopic quantum electrodynamics formulation, we discuss the competition between the collective cooperative dipolar response of a molecular ensemble and local dynamical processes that molecules typically undergo, including chemical reactions. We highlight some of the observable consequences of this competition between local and collective effects in linear transmission spectroscopy, including the formal equivalence between quantum mechanical theory and the classical transfer matrix method, under specific conditions of molecular density and indistinguishability. We also overview recent experimental and theoretical developments on strong and ultrastrong coupling with electronic and vibrational transitions, with a special focus on cavity-modified chemistry and infrared spectroscopy under vibrational strong coupling. We finally suggest several opportunities for further studies that may lead to novel applications in chemical and electroma...
SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery.
EN: In drug-discovery-related tasks such as virtual screening, machine learning is emerging as a promising way to predict molecular properties. Conventionally, molecular fingerprints (numerical representations of molecules) are calculated through rule-based algorithms that map molecules to a sparse discrete space. However, these algorithms perform poorly for shallow prediction models or small datasets. To address this issue, we present SMILES Transformer. Inspired by Transformer and pre-trained language models from natural language processing, SMILES Transformer learns molecular fingerprints through unsupervised pre-training of the sequence-to-sequence language model using a huge corpus of SMILES, a text representation system for molecules. We performed benchmarks on 10 datasets against existing fingerprints and graph-based methods and demonstrated the superiority of the proposed algorithms in small-data settings where pre-training facilitated good generalization. Moreover, we define a novel metric to concurrently measure model accuracy and data efficiency.
Generation of swine movement network and analysis of efficient mitigation strategies for African swine fever virus.
EN: Animal movement networks are essential in understanding and containing the spread of infectious diseases in farming industries. Due to its confidential nature, movement data for the US swine farming population is not readily available. Hence, we propose a method to generate such networks from limited data available in the public domain. As a potentially devastating candidate, we simulate the spread of African swine fever virus (ASFV) in our generated network and analyze how the network structure affects the disease spread. We find that high in-degree farm operations (i.e., markets) play critical roles in the disease spread. We also find that high in-degree based targeted isolation and hypothetical vaccinations are more effective for disease control compared to other centrality-based mitigation strategies. The generated networks can be made more robust by validation with more data whenever more movement data will be available.
Collective Nucleation Dynamics in Two-dimensional Emulsions with Hexagonal Packing.
EN: We report a new mechanism for nucleation in a monolayer of hexagonally packed monodisperse droplet arrays. Upon cooling, we observe solidified droplets to nucleate their supercooled neighbors giving rise to an autocatalytic-like mechanism for accelerated crystallization. This collective mode of nucleation depends on the strength and nature of droplet contacts. Intriguingly, the statistical distribution of the solidified droplet clusters is found to be independent of emulsion characteristics except surfactant. In contrast to classical nucleation theory, our work highlights the need to consider collective effects of nucleation in supercooled concentrated emulsions where droplet crowding is inevitable.
Novel non-equilibrium steady states in multiple emulsions.
EN: We numerically investigate the rheological response of a non-coalescing multiple emulsion under a symmetric shear flow. We find that the dynamics significantly depends on the magnitude of the shear rate and on the number of the encapsulated droplets, two key parameters whose control is fundamental to accurately select the resulting non-equilibrium steady states. The double emulsion, for instance, attains a static steady state in which the external droplet stretches under flow and achieves an elliptical shape (closely resembling the one observed in a sheared isolated fluid droplet), while the internal one remains essentially unaffected. Novel non-equilibrium steady states arise in a multiple emulsion. Under a low/moderate shear rates, for instance, the encapsulated droplets display a non-trivial planetary-like motion that considerably affects the shape of the external droplet. Some features of this dynamic behavior are partially captured by the Taylor deformation parameter and the stress tensor. Besides a theoretical interest on its own, our results can potentially stimulate further experiments, as most of the predictions could be tested in the lab by monitoring droplets shapes and ...
Antibacterial and Antioxidant Activities of Centeurea damascena Methanolic Extract.
EN: The family Asteraceae include large number of Centaurea species which have been applied in folk medicine. One of the family Asteraceae members is the Centaurea damascena which authentically been tested for its antibacterial and antioxidant activity as well as its toxicity. The aims of the study were to determine the antimicrobial and antioxidant activities and toxicity of methanolic plant extracts of Centaurea damascena. The methanolic extracts were screened for their antibacterial activity against nine bacteria (Staphylococcus aureus ATCC 43300, Bacillus subtilis ATCC 6633, Micrococcus luteus ATCC 10240, and Staphylococcus epidermidis ATCC 12228, Escherichia coli ATCC 11293, Pseudomonas aerugino and Klebsiella pneumoniae, Enterobacter aerogenes ATCC 13048 and Salmonella typhi ATCC19430). The antibacterial activity was assessed by using the disc diffusion methods and the minimum inhibition concentrations (MIC) using microdilution method. The extracts from Centaurea damascena possessed antibacterial activity against several of the tested microorganisms. The MIC of methanol extract of C. damascena ranged from 60 to 1100 microgram per mL. Free radical scavenging capacity of the C. dam...
A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification.
EN: Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform po...
A Study of Data Pre-processing Techniques for Imbalanced Biomedical Data Classification.
EN: Biomedical data are widely accepted in developing prediction models for identifying a specific tumor, drug discovery and classification of human cancers. However, previous studies usually focused on different classifiers, and overlook the class imbalance problem in real-world biomedical datasets. There are a lack of studies on evaluation of data pre-processing techniques, such as resampling and feature selection, on imbalanced biomedical data learning. The relationship between data pre-processing techniques and the data distributions has never been analysed in previous studies. This article mainly focuses on reviewing and evaluating some popular and recently developed resampling and feature selection methods for class imbalance learning. We analyse the effectiveness of each technique from data distribution perspective. Extensive experiments have been done based on five classifiers, four performance measures, eight learning techniques across twenty real-world datasets. Experimental results show that: (1) resampling and feature selection techniques exhibit better performance using support vector machine (SVM) classifier. However, resampling and Feature Selection techniques perform po...
Estimating Skin Tone and Effects on Classification Performance in Dermatology Datasets.
EN: Recent advances in computer vision and deep learning have led to breakthroughs in the development of automated skin image analysis. In particular, skin cancer classification models have achieved performance higher than trained expert dermatologists. However, no attempt has been made to evaluate the consistency in performance of machine learning models across populations with varying skin tones. In this paper, we present an approach to estimate skin tone in benchmark skin disease datasets, and investigate whether model performance is dependent on this measure. Specifically, we use individual typology angle (ITA) to approximate skin tone in dermatology datasets. We look at the distribution of ITA values to better understand skin color representation in two benchmark datasets: 1) the ISIC 2018 Challenge dataset, a collection of dermoscopic images of skin lesions for the detection of skin cancer, and 2) the SD-198 dataset, a collection of clinical images capturing a wide variety of skin diseases. To estimate ITA, we first develop segmentation models to isolate non-diseased areas of skin. We find that the majority of the data in the the two datasets have ITA values between 34.5° and 48°...
Using Arabic Tweets to Understand Drug Selling Behaviors.
EN: Twitter is a popular platform for e-commerce in the Arab region including the sale of illegal goods and services. Social media platforms present multiple opportunities to mine information about behaviors pertaining to both illicit and pharmaceutical drugs and likewise to legal prescription drugs sold without a prescription, i.e., illegally. Recognized as a public health risk, the sale and use of illegal drugs, counterfeit versions of legal drugs, and legal drugs sold without a prescription constitute a widespread problem that is reflected in and facilitated by social media. Twitter provides a crucial resource for monitoring legal and illegal drug sales in order to support the larger goal of finding ways to protect patient safety. We collected our dataset using Arabic keywords. We then categorized the data using four machine learning classifiers. Based on a comparison of the respective results, we assessed the accuracy of each classifier in predicting two important considerations in analysing the extent to which drugs are available on social media: references to drugs for sale and the legality/illegality of the drugs thus advertised. For predicting tweets selling drugs, Support Vect...
Quantum non-demolition state detection and spectroscopy of single trapped molecules.
EN: Trapped atoms and ions are among the best controlled quantum systems which find widespread applications in quantum information, sensing and metrology. For molecules, however, a similar degree of control is currently lacking owing to their complex energy-level structure. Quantum-logic protocols in which atomic ions serve as probes for molecular ions are a promising route for achieving this level of control, especially with homonuclear molecules that decouple from black-body radiation. Here, a quantum-non-demolition protocol on single trapped N$_2^+$ molecules is demonstrated. The spin-rovibronic state of the molecule is detected with more than 99% fidelity and the position and strength of a spectroscopic transition in the molecule are determined, both without destroying the molecular quantum state. The present method lays the foundations for new approaches to molecular precision spectroscopy, for state-to-state chemistry on the single-molecule level and for the implementation of molecular qubits.
Quantum non-demolition state detection and spectroscopy of single trapped molecules.
EN: Trapped atoms and ions are among the best controlled quantum systems which find widespread applications in quantum information, sensing and metrology. For molecules, however, a similar degree of control is currently lacking owing to their complex energy-level structure. Quantum-logic protocols in which atomic ions serve as probes for molecular ions are a promising route for achieving this level of control, especially with homonuclear molecules that decouple from black-body radiation. Here, a quantum-non-demolition protocol on single trapped N$_2^+$ molecules is demonstrated. The spin-rovibronic state of the molecule is detected with more than 99% fidelity and the position and strength of a spectroscopic transition in the molecule are determined, both without destroying the molecular quantum state. The present method lays the foundations for new approaches to molecular precision spectroscopy, for state-to-state chemistry on the single-molecule level and for the implementation of molecular qubits.
Quantum non-demolition state detection and spectroscopy of single trapped molecules.
EN: Trapped atoms and ions are among the best controlled quantum systems which find widespread applications in quantum information, sensing and metrology. For molecules, however, a similar degree of control is currently lacking owing to their complex energy-level structure. Quantum-logic protocols in which atomic ions serve as probes for molecular ions are a promising route for achieving this level of control, especially with homonuclear molecules that decouple from black-body radiation. Here, a quantum-non-demolition protocol on single trapped N$_2^+$ molecules is demonstrated. The spin-rovibronic state of the molecule is detected with more than 99% fidelity and the position and strength of a spectroscopic transition in the molecule are determined, both without destroying the molecular quantum state. The present method lays the foundations for new approaches to molecular precision spectroscopy, for state-to-state chemistry on the single-molecule level and for the implementation of molecular qubits.
Transformer-CNN: Fast and Reliable tool for QSAR.
EN: We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model's result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.
A statistical mechanical model for drug release: relations between release parameters and porosity.
EN: A lattice gas model is proposed for investigating the release of drug molecules on devices with semi-permeable, porous membranes in two and three dimensions. The kinetic of this model was obtained through the analytical solution of the three-dimension diffusion equation for systems without membrane and with Monte Carlo simulations. Pharmaceutical data from drug release is usually adjusted to the Weibull function, $\exp [-(t/τ)^b ]$, also known as stretched exponential, and the dependence of adjusted parameters $b$ and $τ$ is usually associated, in the pharmaceutical literature, with physical mechanisms dominating the drug dynamics inside the capsule. The relation of parameters $τ$ and $b$ with porosity $λ$ are found to satisfy, a simple linear relation for between $τ$ and $λ^{-1}$, which can be explained through simple physically based arguments, and a scaling relation between $b$ and $λ$, with the scaling coefficient proportional to the system dimension.
A hypothesis testing framework for the ratio of means of two negative binomial distributions: classifying the efficacy of anthelmintic treatment against intestinal parasites.
EN: Over-dispersed count data typically pose a challenge to analysis using standard statistical methods, particularly when evaluating the efficacy of an intervention through the observed effect on the mean. We outline a novel statistical method for analysing such data, along with a statistically coherent framework within which the observed efficacy is assigned one of four easily interpretable classifications relative to a target efficacy: "adequate", "reduced", "borderline" or "inconclusive". We illustrate our approach by analysing the anthelmintic efficacy of mebendazole using a dataset of egg reduction rates relating to three intestinal parasites from a treatment arm of a randomised controlled trial involving 91 children on Pemba Island, Tanzania. Numerical validation of the type I error rates of the novel method indicate that it performs as well as the best existing computationally-simple method, but with the additional advantage of providing valid inference in the case of an observed efficacy of 100%. The framework and statistical analysis method presented also allow the required sample size of a prospective study to be determined via simulation. Both the framework and method prese...
From Species to Cultivar: Soybean Cultivar Recognition using Multiscale Sliding Chord Matching of Leaf Images.
EN: Leaf image recognition techniques have been actively researched for plant species identification. However it remains unclear whether leaf patterns can provide sufficient information for cultivar recognition. This paper reports the first attempt on soybean cultivar recognition from plant leaves which is not only a challenging research problem but also important for soybean cultivar evaluation, selection and production in agriculture. In this paper, we propose a novel multiscale sliding chord matching (MSCM) approach to extract leaf patterns that are distinctive for soybean cultivar identification. A chord is defined to slide along the contour for measuring the synchronised patterns of exterior shape and interior appearance of soybean leaf images. A multiscale sliding chord strategy is developed to extract features in a coarse-to-fine hierarchical order. A joint description that integrates the leaf descriptors from different parts of a soybean plant is proposed for further enhancing the discriminative power of cultivar description. We built a cultivar leaf image database, SoyCultivar, consisting of 1200 sample leaf images from 200 soybean cultivars for performance evaluation. Encoura...
Combining docking pose rank and structure with deep learning improves protein-ligand binding mode prediction.
EN: We present a simple, modular graph-based convolutional neural network that takes structural information from protein-ligand complexes as input to generate models for activity and binding mode prediction. Complex structures are generated by a standard docking procedure and fed into a dual-graph architecture that includes separate sub-networks for the ligand bonded topology and the ligand-protein contact map. This network division allows contributions from ligand identity to be distinguished from effects of protein-ligand interactions on classification. We show, in agreement with recent literature, that dataset bias drives many of the promising results on virtual screening that have previously been reported. However, we also show that our neural network is capable of learning from protein structural information when, as in the case of binding mode prediction, an unbiased dataset is constructed. We develop a deep learning model for binding mode prediction that uses docking ranking as input in combination with docking structures. This strategy mirrors past consensus models and outperforms the baseline docking program in a variety of tests, including on cross-docking datasets that mimic...
Surface phase transitions in foams and emulsions.
EN: Surface phase transitions in surfactant adsorption layers are known to affect the dynamic properties of foams and to induce surface nucleation in freezing emulsion drops. Recently, these transitions were found to play a role in several other phenomena, opening new opportunities for controlling foam and emulsion properties. This review presents a brief outlook of the emerging opportunities in this area. Three topics are emphasized: (1) The use of surfactant mixtures for inducing phase transitions on bubble surfaces in foams; (2) The peculiar properties of natural surfactants saponins which form extremely viscoelastic surface layers; and (3) The main phenomena in emulsions, for which the surface phase transitions are important. The overall conclusion from the reviewed literature is that surface phase transitions could be used as a powerful tool to control many foam and emulsion properties, but we need deeper understanding of the underlying phenomena to explore fully these opportunities.
Discrete fluidization of dense monodisperse emulsions in neutral wetting microchannels.
EN: The rheology of pressure-driven flows of two-dimensional dense monodisperse emulsions in neutral wetting microchannels is investigated by means of mesoscopic lattice simulations, capable of handling large collections of droplets, in the order of several hundreds. The simulations reveal that the fluidization of the emulsion proceeds through a sequence of discrete steps, characterized by yielding events whereby layers of droplets start rolling over each other, thus leading to sudden drops of the relative effective viscosity. It is shown that such discrete fluidization is robust against loss of confinement, namely it persists also in the regime of small ratios of the droplet diameter over the microchannel width. We also develop a simple phenomenological model which predicts a linear relation between the relative effective viscosity of the emulsion and the product of the confinement parameter (global size of the device over droplet radius) and the viscosity ratio between the disperse and continuous phases. The model shows excellent agreement with the numerical simulations. The present work offers new insights to enable the design of microfluidic scaffolds for tissue engineering applica...
A machine learning method correlating pulse pressure wave data with pregnancy.
EN: Pulse feeling, representing the tactile arterial palpation of the heartbeat, has been widely used in traditional Chinese medicine (TCM) to diagnose various diseases. The quantitative relationship between the pulse wave and health conditions however has not been investigated in modern medicine. In this paper, we explored the correlation between pulse pressure wave (PPW), rather than the pulse key features in TCM, and pregnancy by using deep learning technology. This computational approach shows that the accuracy of pregnancy detection by the PPW is 84% with an AUC of 91%. Our study is a proof of concept of pulse diagnosis and will also motivate further sophisticated investigations on pulse waves.
State-selective coherent motional excitation as a new approach for the manipulation, spectroscopy and state-to-state chemistry of single molecular ions.
EN: We present theoretical and experimental progress towards a new approach for the precision spectroscopy, coherent manipulation and state-to-state chemistry of single isolated molecular ions in the gas phase. Our method consists of a molecular beam for creating packets of rotationally cold neutrals from which a single molecule is state-selectively ionized and trapped inside a radiofrequency ion trap. In addition to the molecular ion, a single co-trapped atomic ion is used to cool the molecular external degrees of freedom to the ground state of the trap and to detect the molecular state using state-selective coherent motional excitation from a modulated optical-dipole force acting on the molecule. We present a detailed discussion and theoretical characterization of the present approach. We simulate the molecular signal experimentally using a single atomic ion indicating that different rovibronic molecular states can be resolved and individually detected with our method. The present approach for the coherent control and non-destructive detection of the quantum state of a single molecular ion opens up new perspectives for precision spectroscopies relevant for, e.g., tests of fundamental...
State-selective coherent motional excitation as a new approach for the manipulation, spectroscopy and state-to-state chemistry of single molecular ions.
EN: We present theoretical and experimental progress towards a new approach for the precision spectroscopy, coherent manipulation and state-to-state chemistry of single isolated molecular ions in the gas phase. Our method consists of a molecular beam for creating packets of rotationally cold neutrals from which a single molecule is state-selectively ionized and trapped inside a radiofrequency ion trap. In addition to the molecular ion, a single co-trapped atomic ion is used to cool the molecular external degrees of freedom to the ground state of the trap and to detect the molecular state using state-selective coherent motional excitation from a modulated optical-dipole force acting on the molecule. We present a detailed discussion and theoretical characterization of the present approach. We simulate the molecular signal experimentally using a single atomic ion indicating that different rovibronic molecular states can be resolved and individually detected with our method. The present approach for the coherent control and non-destructive detection of the quantum state of a single molecular ion opens up new perspectives for precision spectroscopies relevant for, e.g., tests of fundamental...
State-selective coherent motional excitation as a new approach for the manipulation, spectroscopy and state-to-state chemistry of single molecular ions.
EN: We present theoretical and experimental progress towards a new approach for the precision spectroscopy, coherent manipulation and state-to-state chemistry of single isolated molecular ions in the gas phase. Our method consists of a molecular beam for creating packets of rotationally cold neutrals from which a single molecule is state-selectively ionized and trapped inside a radiofrequency ion trap. In addition to the molecular ion, a single co-trapped atomic ion is used to cool the molecular external degrees of freedom to the ground state of the trap and to detect the molecular state using state-selective coherent motional excitation from a modulated optical-dipole force acting on the molecule. We present a detailed discussion and theoretical characterization of the present approach. We simulate the molecular signal experimentally using a single atomic ion indicating that different rovibronic molecular states can be resolved and individually detected with our method. The present approach for the coherent control and non-destructive detection of the quantum state of a single molecular ion opens up new perspectives for precision spectroscopies relevant for, e.g., tests of fundamental...
Molecular Weight Dependent Structure and Polymer Density of the Exopolysaccharide Levan.
EN: Levan is a bacterial homopolysaccharide, which consists of beta-2,6 linked beta-D-fructose monomers. Because of its structural properties and its health promoting effects, levan is a promising functional ingredient for the food, cosmetic and pharma industry. The properties of levan have been reported to be linked to its molecular weight. For a better understanding of how its molecular weight determines its polymer conformation in aqueous solution, levan produced by the food grade acetic acid bacterium Gluconobacter albidus TMW 2.1191 was analysed over a broad molecular weight range using dynamic and static light scattering and viscometry. Levan, with low molecular weight, exhibited a compact random coil structure. As the molecular weight increased, the structure transformed into a compact non-drained sphere. The density of the sphere continued to increase with increasing molecular weight. This resulted in a negative exponent in the Mark-Houwink-Sakurada Plot. For the first time, an increase in molecular density with increasing molecular weight, as determined by a negative Mark-Houwink-Sakurada exponent, could be shown for biopolymers. Our results reveal the unique properties of hig...
Molecular Weight Dependent Structure and Polymer Density of the Exopolysaccharide Levan.
EN: Levan is a bacterial homopolysaccharide, which consists of beta-2,6 linked beta-D-fructose monomers. Because of its structural properties and its health promoting effects, levan is a promising functional ingredient for the food, cosmetic and pharma industry. The properties of levan have been reported to be linked to its molecular weight. For a better understanding of how its molecular weight determines its polymer conformation in aqueous solution, levan produced by the food grade acetic acid bacterium Gluconobacter albidus TMW 2.1191 was analysed over a broad molecular weight range using dynamic and static light scattering and viscometry. Levan, with low molecular weight, exhibited a compact random coil structure. As the molecular weight increased, the structure transformed into a compact non-drained sphere. The density of the sphere continued to increase with increasing molecular weight. This resulted in a negative exponent in the Mark-Houwink-Sakurada Plot. For the first time, an increase in molecular density with increasing molecular weight, as determined by a negative Mark-Houwink-Sakurada exponent, could be shown for biopolymers. Our results reveal the unique properties of hig...
A Hybrid Deep Learning Approach for Diagnosis of the Erythemato-Squamous Disease.
EN: The diagnosis of the Erythemato-squamous disease (ESD) is accepted as a difficult problem in dermatology. ESD is a form of skin disease. It generally causes redness of the skin and also may cause loss of skin. They are generally due to genetic or environmental factors. ESD comprises six classes of skin conditions namely, pityriasis rubra pilaris, lichen planus, chronic dermatitis, psoriasis, seboreic dermatitis and pityriasis rosea. The automated diagnosis of ESD can help doctors and dermatologists in reducing the efforts from their end and in taking faster decisions for treatment. The literature is replete with works that used conventional machine learning methods for the diagnosis of ESD. However, there isn't much instances of application of Deep learning for the diagnosis of ESD. In this paper, we propose a novel hybrid deep learning approach i.e. Derm2Vec for the diagnosis of the ESD. Derm2Vec is a hybrid deep learning model that consists of both Autoencoders and Deep Neural Networks. We also apply a conventional Deep Neural Network (DNN) for the classification of ESD. We apply both Derm2Vec and DNN along with other traditional machine learning methods on a real world dermatolo...
The impact of patient clinical information on automated skin cancer detection.
EN: Skin cancer is one of the most common types of cancer around the world. For this reason, over the past years, different approaches have been proposed to assist detect it. Nonetheless, most of them are based only on dermoscopy images and do not take into account the patient clinical information. In this work, first, we present a new dataset that contains clinical images, acquired from smartphones, and patient clinical information of the skin lesions. Next, we introduce a straightforward approach to combine the clinical data and the images using different well-known deep learning models. These models are applied to the presented dataset using only the images and combining them with the patient clinical information. We present a comprehensive study to show the impact of the clinical data on the final predictions. The results obtained by combining both sets of information show a general improvement of around 7% in the balanced accuracy for all models. In addition, the statistical test indicates significant differences between the models with and without considering both data. The improvement achieved shows the potential of using patient clinical information in skin cancer detection and...
A superpixel-driven deep learning approach for the analysis of dermatological wounds.
EN: Background. The image-based identification of distinct tissues within dermatological wounds enhances patients' care since it requires no intrusive evaluations. This manuscript presents an approach, we named QTDU, that combines deep learning models with superpixel-driven segmentation methods for assessing the quality of tissues from dermatological ulcers. Method. QTDU consists of a three-stage pipeline for the obtaining of ulcer segmentation, tissues' labeling, and wounded area quantification. We set up our approach by using a real and annotated set of dermatological ulcers for training several deep learning models to the identification of ulcered superpixels. Results. Empirical evaluations on 179,572 superpixels divided into four classes showed QTDU accurately spot wounded tissues (AUC = 0.986, sensitivity = 0.97, and specificity = 0.974) and outperformed machine-learning approaches in up to 8.2% regarding F1-Score through fine-tuning of a ResNet-based model. Last, but not least, experimental evaluations also showed QTDU correctly quantified wounded tissue areas within a 0.089 Mean Absolute Error ratio. Conclusions. Results indicate QTDU effectiveness for both tissue segmenta...
PubMedQA: A Dataset for Biomedical Research Question Answering.
EN: We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much r...
PubMedQA: A Dataset for Biomedical Research Question Answering.
EN: We introduce PubMedQA, a novel biomedical question answering (QA) dataset collected from PubMed abstracts. The task of PubMedQA is to answer research questions with yes/no/maybe (e.g.: Do preoperative statins reduce atrial fibrillation after coronary artery bypass grafting?) using the corresponding abstracts. PubMedQA has 1k expert-annotated, 61.2k unlabeled and 211.3k artificially generated QA instances. Each PubMedQA instance is composed of (1) a question which is either an existing research article title or derived from one, (2) a context which is the corresponding abstract without its conclusion, (3) a long answer, which is the conclusion of the abstract and, presumably, answers the research question, and (4) a yes/no/maybe answer which summarizes the conclusion. PubMedQA is the first QA dataset where reasoning over biomedical research texts, especially their quantitative contents, is required to answer the questions. Our best performing model, multi-phase fine-tuning of BioBERT with long answer bag-of-word statistics as additional supervision, achieves 68.1% accuracy, compared to single human performance of 78.0% accuracy and majority-baseline of 55.2% accuracy, leaving much r...
Global Locality in Biomedical Relation and Event Extraction.
EN: Due to the exponential growth of biomedical literature, event and relation extraction are important tasks in biomedical text mining. Most work only focus on relation extraction, and detect a single entity pair mention on a short span of text, which is not ideal due to long sentences that appear in biomedical contexts. We propose an approach to both relation and event extraction, for simultaneously predicting relationships between all mention pairs in a text. We also perform an empirical study to discuss different network setups for this purpose. The best performing model includes a set of multi-head attentions and convolutions, an adaptation of the transformer architecture, which offers self-attention the ability to strengthen dependencies among related elements, and models the interaction between features extracted by multiple attention heads. Experiment results demonstrate that our approach outperforms the state of the art on a set of benchmark biomedical corpora including BioNLP 2009, 2011, 2013 and BioCreative 2017 shared tasks.
Global Locality in Biomedical Relation and Event Extraction.
EN: Due to the exponential growth of biomedical literature, event and relation extraction are important tasks in biomedical text mining. Most work only focus on relation extraction, and detect a single entity pair mention on a short span of text, which is not ideal due to long sentences that appear in biomedical contexts. We propose an approach to both relation and event extraction, for simultaneously predicting relationships between all mention pairs in a text. We also perform an empirical study to discuss different network setups for this purpose. The best performing model includes a set of multi-head attentions and convolutions, an adaptation of the transformer architecture, which offers self-attention the ability to strengthen dependencies among related elements, and models the interaction between features extracted by multiple attention heads. Experiment results demonstrate that our approach outperforms the state of the art on a set of benchmark biomedical corpora including BioNLP 2009, 2011, 2013 and BioCreative 2017 shared tasks.
Learning-Based Video Game Development in MLP@UoM: An Overview.
EN: In general, video games not only prevail in entertainment but also have become an alternative methodology for knowledge learning, skill acquisition and assistance for medical treatment as well as health care in education, vocational/military training and medicine. On the other hand, video games also provide an ideal test bed for AI researches. To a large extent, however, video game development is still a laborious yet costly process, and there are many technical challenges ranging from game generation to intelligent agent creation. Unlike traditional methodologies, in Machine Learning and Perception Lab at the University of Manchester (MLP@UoM), we advocate applying machine learning to different tasks in video game development to address several challenges systematically. In this paper, we overview the main progress made in MLP@UoM recently and have an outlook on the future research directions in learning-based video game development arising from our works.
On the effect of coalescence on the rheology of emulsions.
EN: We present a numerical study of the rheology of a two-fluid emulsion in dilute and semidilute conditions. The analysis is performed for different capillary numbers, volume fraction and viscosity ratio under the assumption of negligible inertia and zero buoyancy force. The effective viscosity of the system increases for low values of the volume fraction and decreases for higher values, with a maximum for about 20 % concentration of the disperse phase. When the dispersed fluid has lower viscosity, the normalised effective viscosity becomes smaller than 1 for high enough volume fractions. To single out the effect of droplet coalescence on the rheology of the emulsion we introduce an Eulerian force which prevents merging, effectively modelling the presence of surfactants in the system. When the coalescence is inhibited the effective viscosity is always greater than 1 and the curvature of the function representing the emulsion effective viscosity vs. the volume fraction becomes positive, resembling the behaviour of suspensions of deformable particles. The reduction of the effective viscosity in the presence of coalescence is associated to the reduction of the total surface of the disper...
Apollonian Packing in Polydisperse Emulsions.
EN: We have discovered the existence of polydisperse High Internal-Phase-Ratio Emulsions (HIPE) in which the internal-phase droplets, present at 95% volume fraction, remain spherical and organize themselves in the available space according to Apollonian packing rules. These polydisperse HIPE are formed during emulsification of surfactant-poor compositions of oil-surfactant-water two-phase systems. Their droplet size-distributions evolve spontaneously towards power laws with the Apollonian exponent. Small-Angle X-Ray Scattering performed on aged HIPEs demonstrated that the droplet packing structure coincided with that of a numerically simulated Random Apollonian Packing. We argue that these peculiar, space-filling assemblies are a result of coalescence and fragmentation processes obeying simple geometrical rules of conserving total volume and minimizing surface area.
Interactive molecular dynamics in virtual reality for accurate flexible protein-ligand docking.
EN: Simulating drug binding and unbinding is a challenge, as the rugged energy landscapes that separate bound and unbound states require extensive sampling that consumes significant computational resources. Here, we describe the use of interactive molecular dynamics in virtual reality (iMD-VR) as an accurate low-cost strategy for flexible protein-ligand docking. We outline an experimental protocol which enables expert iMD-VR users to guide ligands into and out of the binding pockets of trypsin, neuraminidase, and HIV-1 protease, and recreate their respective crystallographic protein-ligand binding poses within 5 - 10 minutes. Following a brief training phase, our studies shown that iMD-VR novices were able to generate unbinding and rebinding pathways on similar timescales as iMD-VR experts, with the majority able to recover binding poses within 2.15 Angstrom RMSD of the crystallographic binding pose. These results indicate that iMD-VR affords sufficient control for users to carry out the detailed atomic manipulations required to dock flexible ligands into dynamic enzyme active sites and recover crystallographic poses, offering an interesting new approach for simulating drug docking and...
BERT-based Ranking for Biomedical Entity Normalization.
EN: Developing high-performance entity normalization algorithms that can alleviate the term variation problem is of great interest to the biomedical community. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings. Bidirectional Encoder Representations from Transformers (BERT), BERT for Biomedical Text Mining (BioBERT) and BERT for Clinical Text Mining (ClinicalBERT) were recently introduced to pre-train contextualized word representation models using bidirectional Transformers, advancing the state-of-the-art for many natural language processing tasks. In this study, we proposed an entity normalization architecture by fine-tuning the pre-trained BERT / BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. Our experimental results show that the best fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase in accuracy.
BERT-based Ranking for Biomedical Entity Normalization.
EN: Developing high-performance entity normalization algorithms that can alleviate the term variation problem is of great interest to the biomedical community. Although deep learning-based methods have been successfully applied to biomedical entity normalization, they often depend on traditional context-independent word embeddings. Bidirectional Encoder Representations from Transformers (BERT), BERT for Biomedical Text Mining (BioBERT) and BERT for Clinical Text Mining (ClinicalBERT) were recently introduced to pre-train contextualized word representation models using bidirectional Transformers, advancing the state-of-the-art for many natural language processing tasks. In this study, we proposed an entity normalization architecture by fine-tuning the pre-trained BERT / BioBERT / ClinicalBERT models and conducted extensive experiments to evaluate the effectiveness of the pre-trained models for biomedical entity normalization using three different types of datasets. Our experimental results show that the best fine-tuned models consistently outperformed previous methods and advanced the state-of-the-art for biomedical entity normalization, with up to 1.17% increase in accuracy.
Optimizing vaccine distribution networks in low and middle-income countries.
EN: Vaccination has been proven to be the most effective method to prevent infectious diseases. However, there are still millions of children in low and middle-income countries who are not covered by routine vaccines and remain at risk. The World Health Organization - Expanded Programme on Immunization (WHO-EPI) was designed to provide universal childhood vaccine access for children across the world and in this work, we address the design of the distribution network for WHO-EPI vaccines. In particular, we formulate the network design problem as a mixed integer program (MIP) and present a new algorithm for typical problems that are too large to be solved using commercial MIP software. We test the algorithm using data derived from four different countries in sub-Saharan Africa and show that the algorithm is able to obtain high-quality solutions for even the largest problems within a few minutes.
A fluid bilayer phase in aqueous mixtures of fatty alcohol and cationic surfactant.
EN: The $L_α$ phase of lipid bilayers is a fluid self-assembled state, key to the formulation of cosmetics, detergents and pharmaceutics. Despite having been extensively scrutinized in self-assembled phospholipid or surfactant bilayers, the formation of a fluid $L_α$ state has defied understanding in mixtures of fatty alcohols, surfactants and water, where is viewed as the essential step for the preparation of creamy dispersions. Here, atomistic molecular dynamics simulations show the existence of a fluid bilayer in aqueous mixtures of cetyl (C${16}$OH) and stearyl (C${18}$OH) alcohols, and cetyl-trimethylammonium chloride (CTAC). These simulated bilayer systems display not only a rich temperature phase diagram with many of the features seen in experiments but carry also the unambigous signature of fluid bilayer behavior.
Encapsulation of oils and fragrances by core-in-shell structures from silica particles, polymers and surfactants: The brick-and-mortar concept.
EN: Colloidosomes provide a possibility to encapsulate oily substances in water in the form of core-in-shell structures. In this study, we produced microcapsules with shell from colloidal particles, where the interparticle openings are blocked by mixed layers from polymer and surfactant that prevent the leakage of cargo molecules. The particles and polymer play the role of bricks and mortar. We used hydrophilic silica particles, which were partially hydrophobized by the adsorption of potassium oleate to enable them to stabilize Pickering emulsions. Various polymers were tested to select the most appropriate one. The procedure of encapsulation is simple and includes single homogenization by ultrasound. The produced capsules are pH responsive. They are stable in aqueous phase of pH in the range 3-6, but at pH>6 they are destabilized and their cargo is released. With the optimized formulation of silica particles, polymer, oleate and NaCl, we were able to encapsulate various oils and fragrances, such as tetradecane, limonene, benzyl salicylate and citronellol. All of them have a limited and not too high solubility in water. In contrast, no stable microcapsules were obtained with oils that ...
An Atomistic First-Principles Density Functional Theory Model for Single Layer Dry \textit{Stratum Corneum}.
EN: Many questions concerning the biophysical and physiological properties of skin are still open. Skin aging, permeability, dermal absorption, hydration and drug transdermal delivery, are few examples of processes with its underlying mechanisms unveiled. In this work we present a first-principles density functional quantum atomistic model for single layer stratum corneum (SC) in order to contribute to unveil the molecular interactions behind the skin properties at this scale. The molecular structure of SC was modeled by an archetype of its hygroscopic proteic portion inside of the corneocytes, the natural moisturizing factor (NMF), coupled to glycerol molecules which represent the lipid fraction of SC. The vibrational spectra was calculated and compared to Fourier-Transform Infrared Absorption spectroscopy (FTIR) experimental data obtained on animal model of SC. We noticed that bands in the fingerprint region (800-1800 cm$^{-1}$) were correctly assigned. Moreover, our calculations revealed the existence of two coupled vibration between hydroxyl group of lipid and NMF methylene (1120 and 1160 cm$^{-1}$), which are of special interest since they probe the lipid-amino acid coupling. The ...
Quantified uncertainty of flexible protein-protein docking algorithms.
EN: The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, firs...
Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions.
EN: Machine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for target electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.
Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions.
EN: Machine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for target electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.
Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions.
EN: Machine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for target electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.
Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions.
EN: Machine learning advances chemistry and materials science by enabling large-scale exploration of chemical space based on quantum chemical calculations. While these models supply fast and accurate predictions of atomistic chemical properties, they do not explicitly capture the electronic degrees of freedom of a molecule, which limits their applicability for reactive chemistry and chemical analysis. Here we present a deep learning framework for the prediction of the quantum mechanical wavefunction in a local basis of atomic orbitals from which all other ground-state properties can be derived. This approach retains full access to the electronic structure via the wavefunction at force field-like efficiency and captures quantum mechanics in an analytically differentiable representation. On several examples, we demonstrate that this opens promising avenues to perform inverse design of molecular structures for target electronic property optimisation and a clear path towards increased synergy of machine learning and quantum chemistry.
Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models.
EN: We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{https://alchemy.tencent.com}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models.
EN: We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{https://alchemy.tencent.com}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models.
EN: We introduce a new molecular dataset, named Alchemy, for developing machine learning models useful in chemistry and material science. As of June 20th 2019, the dataset comprises of 12 quantum mechanical properties of 119,487 organic molecules with up to 14 heavy atoms, sampled from the GDB MedChem database. The Alchemy dataset expands the volume and diversity of existing molecular datasets. Our extensive benchmarks of the state-of-the-art graph neural network models on Alchemy clearly manifest the usefulness of new data in validating and developing machine learning models for chemistry and material science. We further launch a contest to attract attentions from researchers in the related fields. More details can be found on the contest website \footnote{https://alchemy.tencent.com}. At the time of benchamrking experiment, we have generated 119,487 molecules in our Alchemy dataset. More molecular samples are generated since then. Hence, we provide a list of molecules used in the reported benchmarks.
Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations.
EN: Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph ...
Graph Embedding on Biomedical Networks: Methods, Applications, and Evaluations.
EN: Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug-drug interaction (DDI) prediction, protein-protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph ...
Vaccination strategies to control Ebola epidemics in the context of variable household inaccessibility levels.
EN: In the context of the ongoing Ebola epidemic in DRC, active conflict and community distrust are undermining control efforts, including vaccination strategies. In this paper, we employed an individual-level stochastic structured transmission model to assess the impact of vaccination strategies on epidemic control in the context of variable levels of household inaccessibility. We found that a ring vaccination strategy of close contacts would not be effective for containing the epidemic in the context of significant delays to vaccinating contacts even for low levels of household inaccessibility and evaluate the impact of a supplemental community vaccination strategy. For lower levels of inaccessibility, the probability of epidemic containment increases over time. For higher levels of inaccessibility, even the combined ring and community vaccination strategies are not expected to contain the epidemic even though they help lower incidence levels, which saves lives, makes the epidemic easier to contain and reduces spread to other communities. We found that ring vaccination is effective for containing an outbreak until the levels of inaccessibility exceeds approximately 10%, a combined ri...
Deep Contextualized Biomedical Abbreviation Expansion.
EN: Automatic identification and expansion of ambiguous abbreviations are essential for biomedical natural language processing applications, such as information retrieval and question answering systems. In this paper, we present DEep Contextualized Biomedical. Abbreviation Expansion (DECBAE) model. DECBAE automatically collects substantial and relatively clean annotated contexts for 950 ambiguous abbreviations from PubMed abstracts using a simple heuristic. Then it utilizes BioELMo to extract the contextualized features of words, and feed those features to abbreviation-specific bidirectional LSTMs, where the hidden states of the ambiguous abbreviations are used to assign the exact definitions. Our DECBAE model outperforms other baselines by large margins, achieving average accuracy of 0.961 and macro-F1 of 0.917 on the dataset. It also surpasses human performance for expanding a sample abbreviation, and remains robust in imbalanced, low-resources and clinical settings.
Deep Contextualized Biomedical Abbreviation Expansion.
EN: Automatic identification and expansion of ambiguous abbreviations are essential for biomedical natural language processing applications, such as information retrieval and question answering systems. In this paper, we present DEep Contextualized Biomedical. Abbreviation Expansion (DECBAE) model. DECBAE automatically collects substantial and relatively clean annotated contexts for 950 ambiguous abbreviations from PubMed abstracts using a simple heuristic. Then it utilizes BioELMo to extract the contextualized features of words, and feed those features to abbreviation-specific bidirectional LSTMs, where the hidden states of the ambiguous abbreviations are used to assign the exact definitions. Our DECBAE model outperforms other baselines by large margins, achieving average accuracy of 0.961 and macro-F1 of 0.917 on the dataset. It also surpasses human performance for expanding a sample abbreviation, and remains robust in imbalanced, low-resources and clinical settings.
Natural Deep Eutectic Solvents as Agents for Improving Solubility, Stability and Delivery of Curcumin.
EN: Purpose Study on curcumin dissolved in natural deep eutectic solvents (NADES) was aimed at exploiting their beneficial properties as drug carriers. Methods The concentration of dissolved curcumin in NADES was measured. Simulated gastrointestinal fluids were used to determine the concentration of curcumin and quantum chemistry computations were performed for clarifying the origin of curcumin solubility enhancement in NADES. Results NADES comprising choline chloride and glycerol had the highest potential for curcumin dissolution. This system was also successfully applied as an extraction medium for obtaining curcuminoids from natural sources, as well as an effective stabilizer preventing curcumin degradation from sunlight. The solubility of curcumin in simulated gastrointestinal fluids revealed that the significant increase of bioavailability takes place in the small intestinal fluid. Conclusions Suspension of curcumin in NADES offers beneficial properties of this new liquid drug formulation starting from excreting from natural sources, through safe storage and ending on the final administration route. Therefore, there is a possibility of using a one-step process with this medium. Th...
Visual Diagnosis of Dermatological Disorders: Human and Machine Performance.
EN: Skin conditions are a global health concern, ranking the fourth highest cause of nonfatal disease burden when measured as years lost due to disability. As diagnosing, or classifying, skin diseases can help determine effective treatment, dermatologists have extensively researched how to diagnose conditions from a patient's history and the lesion's visual appearance. Computer vision researchers are attempting to encode this diagnostic ability into machines, and several recent studies report machine level performance comparable with dermatologists. This report reviews machine approaches to classify skin images and consider their performance when compared to human dermatologists. Following an overview of common image modalities, dermatologists' diagnostic approaches and common tasks, and publicly available datasets, we discuss approaches to machine skin lesion classification. We then review works that directly compare human and machine performance. Finally, this report addresses the limitations and sources of errors in image-based skin disease diagnosis, applicable to both machines and dermatologists in a teledermatology setting.
Modelling double emulsion formation in planar flow-focusing microchannels.
EN: Double emulsion formation in a hierarchical flow-focusing channel is systematically investigated using a free energy ternary lattice Boltzmann model. A three dimensional formation regime diagram is constructed based on the capillary numbers of the inner ($Ca_i$), middle ($Ca_m$) and outer ($Ca_o$) phase fluids. The results show that the formation diagram can be classified into periodic two-step region, periodic one-step region, and non-periodic region. By varying $Ca_i$ and $Ca_m$ in the two-step formation region, different morphologies are obtained, including the regular double emulsions, decussate regimes with one or two alternate empty droplets, and structures with multiple inner droplets contained in the continuous middle phase thread. Bidisperse behaviors are also frequently encountered in the two-step formation region. In the periodic one-step formation region, scaling laws are proposed for the double emulsion size and for the size ratio between the inner droplet and the overall double emulsion. Furthermore, we show that the interfacial tension ratio can greatly change the morphologies of the obtained emulsion droplets, and the channel geometry plays an important role in chan...
Mathematical Discovery of Natural Laws in Biomedical Sciences: A New Methodology.
EN: As biomedical sciences discover new layers of complexity in the mechanisms of life and disease, mathematical models trying to catch up with these developments become mathematically intractable. As a result, in the grand scheme of things, mathematical models have so far played an auxiliary role in biomedical sciences. We propose a new methodology allowing mathematical modeling to give, in certain cases, definitive answers to systemic biomedical questions that elude empirical resolution. Our methodology is based on two ideas: (1) employing mathematical models that are firmly rooted in established biomedical knowledge yet so general that they can account for any, or at least many, biological mechanisms, both known and unknown; (2) finding model parameters whose likelihood-maximizing values are independent of observations (existence of such parameters implies that the model must not meet regularity conditions required for the consistency of maximum likelihood estimator). These universal parameter values may reveal general patterns (that we call natural laws) in biomedical processes. We illustrate this approach with the discovery of a clinically important natural law governing cancer me...
Mathematical Discovery of Natural Laws in Biomedical Sciences: A New Methodology.
EN: As biomedical sciences discover new layers of complexity in the mechanisms of life and disease, mathematical models trying to catch up with these developments become mathematically intractable. As a result, in the grand scheme of things, mathematical models have so far played an auxiliary role in biomedical sciences. We propose a new methodology allowing mathematical modeling to give, in certain cases, definitive answers to systemic biomedical questions that elude empirical resolution. Our methodology is based on two ideas: (1) employing mathematical models that are firmly rooted in established biomedical knowledge yet so general that they can account for any, or at least many, biological mechanisms, both known and unknown; (2) finding model parameters whose likelihood-maximizing values are independent of observations (existence of such parameters implies that the model must not meet regularity conditions required for the consistency of maximum likelihood estimator). These universal parameter values may reveal general patterns (that we call natural laws) in biomedical processes. We illustrate this approach with the discovery of a clinically important natural law governing cancer me...
All SMILES Variational Autoencoder.
EN: Variational autoencoders (VAEs) defined over SMILES string and graph-based representations of molecules promise to improve the optimization of molecular properties, thereby revolutionizing the pharmaceuticals and materials industries. However, these VAEs are hindered by the non-unique nature of SMILES strings and the computational cost of graph convolutions. To efficiently pass messages along all paths through the molecular graph, we encode multiple SMILES strings of a single molecule using a set of stacked recurrent neural networks, pooling hidden representations of each atom between SMILES representations, and use attentional pooling to build a final fixed-length latent representation. By then decoding to a disjoint set of SMILES strings of the molecule, our All SMILES VAE learns an almost bijective mapping between molecules and latent representations near the high-probability-mass subspace of the prior. Our SMILES-derived but molecule-based latent representations significantly surpass the state-of-the-art in a variety of fully- and semi-supervised property regression and molecular property optimization tasks.
Using Neural Networks for Relation Extraction from Biomedical Literature.
EN: Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
Using Neural Networks for Relation Extraction from Biomedical Literature.
EN: Using different sources of information to support automated extracting of relations between biomedical concepts contributes to the development of our understanding of biological systems. The primary comprehensive source of these relations is biomedical literature. Several relation extraction approaches have been proposed to identify relations between concepts in biomedical literature, namely, using neural networks algorithms. The use of multichannel architectures composed of multiple data representations, as in deep neural networks, is leading to state-of-the-art results. The right combination of data representations can eventually lead us to even higher evaluation scores in relation extraction tasks. Thus, biomedical ontologies play a fundamental role by providing semantic and ancestry information about an entity. The incorporation of biomedical ontologies has already been proved to enhance previous state-of-the-art results.
Limitations in Predicting Radiation-Induced Pharmaceutical Instability during Long-Duration Spaceflight.
EN: As human spaceflight seeks to expand beyond low-Earth orbit, NASA and its international partners face numerous challenges related to ensuring the safety of their astronauts, including the need to provide a safe and effective pharmacy for long-duration spaceflight. Historical missions have relied upon frequent resupply of onboard pharmaceuticals; as a result, there has been little study into the effects of long-term exposure of pharmaceuticals to the space environment. Of particular concern are the long-term effects of space radiation on drug stability, especially as missions venture away from the protective proximity of the Earth. Here we highlight the risk of space radiation to pharmaceuticals during exploration spaceflight, identifying the limitations of current understanding. We further seek to identify ways in which these limitations could be addressed through dedicated research efforts aimed towards the rapid development of an effective pharmacy for future spaceflight endeavors.
A Fuzzy Inference System for the Identification.
EN: Odor identification is an important area in a wide range of industries like cosmetics, food, beverages and medical diagnosis among others. Odor detection could be done through an array of gas sensors conformed as an electronic nose where a data acquisition module converts sensor signals to a standard output to be analyzed. To facilitate odors detection a system is required for the identification. This paper presents the results of an automated odor identification process implemented by a fuzzy system and an electronic nose. First, an electronic nose prototype is manufactured to detect organic compounds vapor using an array of five tin dioxide gas sensors, an arduino uno board is used as a data acquisition section. Second, an intelligent module with a fuzzy system is considered for the identification of the signals received by the electronic nose. This solution proposes a system to identify odors by using a personal computer. Results show an acceptable precision.
Comprehensive classification of the plant non-specific lipid transfer protein superfamily towards its Sequence -Structure -Function analysis.
EN: Background. Non-specific Lipid Transfer Proteins (nsLTPs) are widely distributed in the plant kingdom and constitute a superfamily of related proteins. More than 800 different sequences have been characterized so far, but their biological functions remain unclear. It has been clear for years that they present a certain interest for agronomic and nutritional issues. Deciphering their functions means collecting and analyzing a variety of data from gene sequence to protein structure, from cellular localization to the physiological role. As a huge and growing number of new protein sequences are available nowadays, extracting meaningful knowledge from sequence-structure-function relationships calls for the development of new tools and approaches. As nsLTPs show high evolutionary divergence, but a conserved common right-handed superhelix structural fold, and as they are involved in a large number of key roles in plant development and defense, they are a stimulating case study for validating such an approach.
Drug-Drug Adverse Effect Prediction with Graph Co-Attention.
EN: Complex or co-existing diseases are commonly treated using drug combinations, which can lead to higher risk of adverse side effects. The detection of polypharmacy side effects is usually done in Phase IV clinical trials, but there are still plenty which remain undiscovered when the drugs are put on the market. Such accidents have been affecting an increasing proportion of the population (15% in the US now) and it is thus of high interest to be able to predict the potential side effects as early as possible. Systematic combinatorial screening of possible drug-drug interactions (DDI) is challenging and expensive. However, the recent significant increases in data availability from pharmaceutical research and development efforts offer a novel paradigm for recovering relevant insights for DDI prediction. Accordingly, several recent approaches focus on curating massive DDI datasets (with millions of examples) and training machine learning models on them. Here we propose a neural network architecture able to set state-of-the-art results on this task---using the type of the side-effect and the molecular structure of the drugs alone---by leveraging a co-attentional mechanism. In particular,...
Molecular shape as a (useful) bias in chemistry.
EN: One of the molecular properties most intuitive to the human perception is the geometrical shape. However, when exploring a large chemical space the determination of shape needs to be automated. We present a fast and simple approach to identify a molecule as linear, planar, cube, cuboid, disk, elliptical disk, spheroid and sphere which is more fine grained than existing approaches. The method is applied to more than one billion molecules ranging from small organic molecules to whole proteins. The results show that current chemistry research is biased towards planar geometries. Moreover, we demonstrate that our molecular shape classification correlates with sought-after properties like the band gap, dipole moment, and heat capacity. This allows to increase the efficiency of molecular design studies by driving high-throughput-screening efforts towards desired values of molecular properties.
Molecular shape as a (useful) bias in chemistry.
EN: One of the molecular properties most intuitive to the human perception is the geometrical shape. However, when exploring a large chemical space the determination of shape needs to be automated. We present a fast and simple approach to identify a molecule as linear, planar, cube, cuboid, disk, elliptical disk, spheroid and sphere which is more fine grained than existing approaches. The method is applied to more than one billion molecules ranging from small organic molecules to whole proteins. The results show that current chemistry research is biased towards planar geometries. Moreover, we demonstrate that our molecular shape classification correlates with sought-after properties like the band gap, dipole moment, and heat capacity. This allows to increase the efficiency of molecular design studies by driving high-throughput-screening efforts towards desired values of molecular properties.
Molecular shape as a (useful) bias in chemistry.
EN: One of the molecular properties most intuitive to the human perception is the geometrical shape. However, when exploring a large chemical space the determination of shape needs to be automated. We present a fast and simple approach to identify a molecule as linear, planar, cube, cuboid, disk, elliptical disk, spheroid and sphere which is more fine grained than existing approaches. The method is applied to more than one billion molecules ranging from small organic molecules to whole proteins. The results show that current chemistry research is biased towards planar geometries. Moreover, we demonstrate that our molecular shape classification correlates with sought-after properties like the band gap, dipole moment, and heat capacity. This allows to increase the efficiency of molecular design studies by driving high-throughput-screening efforts towards desired values of molecular properties.
Molecular shape as a (useful) bias in chemistry.
EN: One of the molecular properties most intuitive to the human perception is the geometrical shape. However, when exploring a large chemical space the determination of shape needs to be automated. We present a fast and simple approach to identify a molecule as linear, planar, cube, cuboid, disk, elliptical disk, spheroid and sphere which is more fine grained than existing approaches. The method is applied to more than one billion molecules ranging from small organic molecules to whole proteins. The results show that current chemistry research is biased towards planar geometries. Moreover, we demonstrate that our molecular shape classification correlates with sought-after properties like the band gap, dipole moment, and heat capacity. This allows to increase the efficiency of molecular design studies by driving high-throughput-screening efforts towards desired values of molecular properties.
Deep Neural Networks Ensemble for Detecting Medication Mentions in Tweets.
EN: Objective: After years of research, Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step to incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. Given that lexical searches for medication names may fail due to misspellings or ambiguity with common words, we propose a more advanced method to recognize them. Methods: We present Kusuri, an Ensemble Learning classifier, able to identify tweets mentioning drug products and dietary supplements. Kusuri ("medication" in Japanese) is composed of two modules. First, four different classifiers (lexicon-based, spelling-variant-based, pattern-based and one based on a weakly-trained neural network) are applied in parallel to discover tweets potentially containing medication names. Second, an ensemble of deep neural networks encoding morphological, semantical and long-range dependencies of important words in the tweets discovered is used to make the final decision. Results: On a balanced (50-50) corpus of 15,005 tweets, Kusuri demonstrated performances close to human annotators w...
Attention-based Multi-instance Neural Network for Medical Diagnosis from Incomplete and Low Quality Data.
EN: One way to extract patterns from clinical records is to consider each patient record as a bag with various number of instances in the form of symptoms. Medical diagnosis is to discover informative ones first and then map them to one or more diseases. In many cases, patients are represented as vectors in some feature space and a classifier is applied after to generate diagnosis results. However, in many real-world cases, data is often of low-quality due to a variety of reasons, such as data consistency, integrity, completeness, accuracy, etc. In this paper, we propose a novel approach, attention based multi-instance neural network (AMI-Net), to make the single disease classification only based on the existing and valid information in the real-world outpatient records. In the context of a patient, it takes a bag of instances as input and output the bag label directly in end-to-end way. Embedding layer is adopted at the beginning, mapping instances into an embedding space which represents the individual patient condition. The correlations among instances and their importance for the final classification are captured by multi-head attention transformer, instance-level multi-instance po...
PyRod -- Tracing Water Molecules in Molecular Dynamics Simulations.
EN: Ligands entering a protein binding pocket essentially compete with water molecules for binding to the protein. Hence, the location and thermodynamic properties of water molecules in protein structures have gained increased attention in the drug design community. Including corresponding data into 3D pharmacophore modeling is essential for efficient high throughput virtual screening. Here, we present PyRod, a free and open-source python software that allows for visualization of pharmacophoric binding pocket characteristics, identification of hot spots for ligand binding and subsequent generation of pharmacophore features for virtual screening. The implemented routines analyze the protein environment of water molecules in molecular dynamics (MD) simulations and can differentiate between hydrogen bonded waters as well as waters in a protein environment of hydrophobic, charged or aromatic atom groups. The gathered information is further processed to generate dynamic molecular interaction fields (dMIFs) for visualization and pharmacophoric features for virtual screening. The described software was applied to 5 therapeutically relevant drug targets and generated pharmacophores were evalua...
Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization.
EN: The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest or a deep neural network, leverage fixed fingerprint feature representations of molecules. In contrast, in this paper, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph, where each node is an atom and each edge is a bond. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but also extrapolate to new regions of chemical space.
Step Change Improvement in ADMET Prediction with PotentialNet Deep Featurization.
EN: The Absorption, Distribution, Metabolism, Elimination, and Toxicity (ADMET) properties of drug candidates are estimated to account for up to 50% of all clinical trial failures. Predicting ADMET properties has therefore been of great interest to the cheminformatics and medicinal chemistry communities in recent decades. Traditional cheminformatics approaches, whether the learner is a random forest or a deep neural network, leverage fixed fingerprint feature representations of molecules. In contrast, in this paper, we learn the features most relevant to each chemical task at hand by representing each molecule explicitly as a graph, where each node is an atom and each edge is a bond. By applying graph convolutions to this explicit molecular representation, we achieve, to our knowledge, unprecedented accuracy in prediction of ADMET properties. By challenging our methodology with rigorous cross-validation procedures and prospective analyses, we show that deep featurization better enables molecular predictors to not only interpolate but also extrapolate to new regions of chemical space.
Learning Super-resolution 3D Segmentation of Plant Root MRI Images from Few Examples.
EN: Analyzing plant roots is crucial to understand plant performance in different soil environments. While magnetic resonance imaging (MRI) can be used to obtain 3D images of plant roots, extracting the root structural model is challenging due to highly noisy soil environments and low-resolution of MRI images. To improve both contrast and resolution, we adapt the state-of-the-art method RefineNet for 3D segmentation of the plant root MRI images in super-resolution. The networks are trained from few manual segmentations that are augmented by geometric transformations, realistic noise, and other variabilities. The resulting segmentations contain most root structures, including branches not extracted by the human annotator.
ColourQuant: a high-throughput technique to extract and quantify colour phenotypes from plant images.
EN: Colour patterning contributes to important plant traits that influence ecological interactions, horticultural breeding, and agricultural performance. High-throughput phenotyping of colour is valuable for understanding plant biology and selecting for traits related to colour during plant breeding. Here we present ColourQuant, an automated high-throughput pipeline that allows users to extract colour phenotypes from images. This pipeline includes methods for colour phenotyping using mean pixel values, Gaussian density estimator of Lab colour, and the analysis of shape-independent colour patterning by circular deformation.
Geometry and kinetics determine the microstructure in arrested coalescence of Pickering emulsion droplets.
EN: An important strategy to stabilize emulsions is to arrest coalescence of the constituent droplets with an opposing rheological force. Colloidal particles adsorbed on the surface of emulsion droplets in a Pickering emulsion become increasingly crowded during successive coalescence events because the combined surface area of coalescing droplets is less than that of the constituent droplets. Beyond a critical density, the particles form a rigid shell around the droplet and inhibit both relaxation of the droplet shape and further coalescence. The resulting droplets have a nonuniform distribution of curvature and, depending on the initial coverage, may incorporate a region with negative Gaussian curvature around the neck that bridges the two droplets. Here, we resolve the relative influence of the curvature and the kinetic process of arrest on the microstructure of the final state. Identifying the dimensionless ratio of the rate of area change \dot{A} to the diffusion constant D as a measure of the importance of kinetics in this system, we show that this depends on the extrinsic geometry of the surface as opposed to the static packings that depend solely on intrinsic geometry.
MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts.
EN: This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.
MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts.
EN: This paper presents the formal release of MedMentions, a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines. In addition to the full corpus, a sub-corpus of MedMentions is also presented, comprising annotations for a subset of UMLS 2017 targeted towards document retrieval. To encourage research in Biomedical Named Entity Recognition and Linking, data splits for training and testing are included in the release, and a baseline model and its metrics for entity linking are also described.
Soft meets hard -- how does freeze-thaw cycling affect the microstructure of particle-stabilised emulsions?.
EN: The freeze-thaw cycling of particle-stabilised emulsions can alter the emulsion structure and stability. This could have significant consequences for using particle stabilisation in industrial applications where increased stability is generally desirable. It is therefore important to characterise the behaviour and stability of these composites under the influence of freeze-thaw cycles. Water-in-oil Pickering emulsions stabilised by poly(methyl methacrylate) particles were subjected to freeze-thaw cycles of the continuous phase under two different conditions - uniform and non-uniform freezing. Confocal microscopy was used to study the emulsion behaviour and structure during these processes. The effect of droplet size and cooling rate on uniformly frozen emulsions was also considered. The final structure of the emulsion after a single freeze-thaw cycle is strongly dependent on the freezing method. Uniformly frozen emulsions show crumpled droplet structures, while non-uniformly frozen emulsions have a non-uniform structure containing foam-like regions not observed in uniform freezing. Droplet size has little effect on the final structure of uniformly frozen emulsions, which we attribu...
Deep learning algorithms out-perform veterinary pathologists in detecting the mitotically most active tumor region.
EN: Manual count of mitotic figures, which is determined in the tumor region with the highest mitotic activity, is a key parameter of most tumor grading schemes. It can be, however, strongly dependent on the area selection due to uneven mitotic figure distribution in the tumor section.We aimed to assess the question, how significantly the area selection could impact the mitotic count, which has a known high inter-rater disagreement. On a data set of 32 whole slide images of H&E-stained canine cutaneous mast cell tumor, fully annotated for mitotic figures, we asked eight veterinary pathologists (five board-certified, three in training) to select a field of interest for the mitotic count. To assess the potential difference on the mitotic count, we compared the mitotic count of the selected regions to the overall distribution on the slide.Additionally, we evaluated three deep learning-based methods for the assessment of highest mitotic density: In one approach, the model would directly try to predict the mitotic count for the presented image patches as a regression task. The second method aims at deriving a segmentation mask for mitotic figures, which is then used to obtain a mitotic dens...
Skin Lesion Synthesis with Generative Adversarial Networks.
EN: Skin cancer is by far the most common type of cancer. Early detection is the key to increase the chances for successful treatment significantly. Currently, Deep Neural Networks are the state-of-the-art results on automated skin cancer classification. To push the results further, we need to address the lack of annotated data, which is expensive and require much effort from specialists. To bypass this problem, we propose using Generative Adversarial Networks for generating realistic synthetic skin lesion images. To the best of our knowledge, our results are the first to show visually-appealing synthetic images that comprise clinically-meaningful information.
Interactive molecular dynamics in virtual reality from quantum chemistry to drug binding: An open-source multi-person framework.
EN: As molecular scientists have made progress in their ability to engineer nano-scale molecular structure, we are facing new challenges in our ability to engineer molecular dynamics (MD) and flexibility. Dynamics at the molecular scale differs from the familiar mechanics of everyday objects, because it involves a complicated, highly correlated, and three-dimensional many-body dynamical choreography which is often non-intuitive even for highly trained researchers. We recently described how interactive molecular dynamics in virtual reality (iMD-VR) can help to meet this challenge, enabling researchers to manipulate real-time MD simulations of flexible structures in 3D. In this article, we outline efforts to extend immersive technologies to the molecular sciences, and we introduce 'Narupa', a flexible, open-source, multi-person iMD-VR software framework which enables groups of researchers to simultaneously cohabit real-time simulation environments to interactively visualize and manipulate the dynamics of molecular structures with atomic-level precision. We outline several application domains where iMD-VR is facilitating research, communication, and creative approaches within the molecula...
WideDTA: prediction of drug-target binding affinity.
EN: Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using...
Bayesian active learning for optimization and uncertainty quantification in protein docking.
EN: Motivation: Ab initio protein docking represents a major challenge for optimizing a noisy and costly "black box"-like function in a high-dimensional space. Despite progress in this field, there is no docking method available for rigorous uncertainty quantification (UQ) of its solution quality (e.g. interface RMSD or iRMSD). Results: We introduce a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of such black-box functions and flexible protein docking. BAL directly models the posterior distribution of the global optimum (or native structures for protein docking) with active sampling and posterior estimation iteratively feeding each other. Furthermore, we use complex normal modes to represent a homogeneous Euclidean conformation space suitable for high-dimension optimization and construct funnel-like energy models for encounter complexes. Over a protein docking benchmark set and a CAPRI set including homology docking, we establish that BAL significantly improve against both starting points by rigid docking and refinements by particle swarm optimization, providing for one third targets a top-3 near-native prediction. BAL also generates tight confidence inter...
Electrorheology of a dilute emulsion of surfactant-covered drops.
EN: The effects of surfactant coating on a deformable viscous drop under the combined action of a shear flow and a uniform electric field, are investigated by solving the coupled equations of electrostatics, fluid flow and surfactant transport. Employing a comprehensive three-dimensional solution technique, the non-Newtonian shearing response of the bulk emulsion is analyzed in the dilute suspension regime. The present results reveal that the surfactant non-uniformity creates significant alterations in the flow disturbance around the drop, thereby influencing the viscous dissipation from the flowing emulsion. This, in effect, triggers changes in the bulk shear viscosity. It is striking to observe that the balance between electrical and hydrodynamic stresses is affected in such a way that surface tension gradient on the drop surface vanishes for some specific shear rates and the corresponding effective change in the bulk viscosity becomes negligible too. This critical condition hugely depends on the electrical permittivity and conductivity ratios of the two fluids and orientation of the applied electric field. Also the physical mechanisms of charge convection of surface deformation play...
Dynamics of growth and form in prebiotic vesicles.
EN: The growth, form, and division of prebiotic vesicles, membraneous bags of fluid of varying components and shapes is hypothesized to have served as the substrate for the origin of life. The dynamics of these out-of-equilibrium structures is controlled by physicochemical processes that include the intercalation of amphiphiles into the membrane, fluid flow across the membrane, and elastic deformations of the membrane. To understand prebiotic vesicular forms and their dynamics, we construct a minimal model that couples membrane growth, deformation, and fluid permeation, ultimately couched in terms of two dimensionless parameters that characterize the relative rate of membrane growth and the membrane permeability. Numerical simulations show that our model captures the morphological diversity seen in extant precursor mimics of cellular life, and might provide simple guidelines for the synthesis of these complex shapes from simple ingredients.
Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Hansen Solubility Parameters Based on 1D and 2D Molecular Descriptors Computed from SMILES String.
EN: A new method of Hansen solubility parameters (HSPs) prediction was developed by combining the multivariate adaptive regression splines (MARSplines) methodology with a simple multivariable regression involving 1D and 2D PaDEL molecular descriptors. In order to adopt the MARSplines approach to QSPR/QSAR problems, several optimization procedures were proposed and tested. The effectiveness of the obtained models was checked via standard QSPR/QSAR internal validation procedures provided by the QSARINS software and by predicting the solubility classification of polymers and drug-like solid solutes in collections of solvents. By utilizing information derived only from SMILES strings, the obtained models allow for computing all of the three Hansen solubility parameters including dispersion, polarization, and hydrogen bonding. Although several descriptors are required for proper parameters estimation, the proposed procedure is simple and straightforward and does not require a molecular geometry optimization. The obtained HSP values are highly correlated with experimental data, and their application for solving solubility problems leads to essentially the same quality as for the original par...
Internal conversion and intersystem crossing pathways in UV excited, isolated uracils and their implications in prebiotic chemistry.
EN: The photodynamic properties of molecules determine their ability to survive in harsh radiation environments. As such, the photostability of heterocyclic aromatic compounds to electromagnetic radiation is expected to have been one of the selection pressures influencing the prebiotic chemistry on early Earth. In the present study, the gas-phase photodynamics of uracil, 5-methyluracil (thymine) and 2-thiouracil -- three heterocyclic compounds thought to be present during this era -- are assessed in the context of their recently proposed intersystem crossing pathways that compete with internal conversion to the ground state. Specifically, time-resolved photoelectron spectroscopy measurements evidence femtosecond to picosecond timescales for relaxation of the singlet 1$ππ$ and 1n$π$ states as well as for intersystem crossing to the triplet manifold. Trapping in the excited triplet state and intersystem crossing back to the ground state are investigated as potential factors contributing to the susceptibility of these molecules to ultraviolet photodamage.
Drug cell line interaction prediction.
EN: Understanding the phenotypic drug response on cancer cell lines plays a vital rule in anti-cancer drug discovery and re-purposing. The Genomics of Drug Sensitivity in Cancer (GDSC) database provides open data for researchers in phenotypic screening to test their models and methods. Previously, most research in these areas starts from the fingerprints or features of drugs, instead of their structures. In this paper, we introduce a model for phenotypic screening, which is called twin Convolutional Neural Network for drugs in SMILES format (tCNNS). tCNNS is comprised of CNN input channels for drugs in SMILES format and cancer cell lines respectively. Our model achieves $0.84$ for the coefficient of determinant($R^2$) and $0.92$ for Pearson correlation($R_p$), which are significantly better than previous works\cite{ammad2014integrative,haider2015copula,menden2013machine}. Besides these statistical metrics, tCNNS also provides some insights into phenotypic screening.
CNN based Multi-Instance Multi-Task Learning for Syndrome Differentiation of Diabetic Patients.
EN: Syndrome differentiation in Traditional Chinese Medicine (TCM) is the process of understanding and reasoning body condition, which is the essential step and premise of effective treatments. However, due to its complexity and lack of standardization, it is challenging to achieve. In this study, we consider each patient's record as a one-dimensional image and symptoms as pixels, in which missing and negative values are represented by zero pixels. The objective is to find relevant symptoms first and then map them to proper syndromes, that is similar to the object detection problem in computer vision. Inspired from it, we employ multi-instance multi-task learning combined with the convolutional neural network (MIMT-CNN) for syndrome differentiation, which takes region proposals as input and output image labels directly. The neural network consists of region proposals generation, convolutional layer, fully connected layer, and max pooling (multi-instance pooling) layer followed by the sigmoid function in each syndrome prediction task for image representation learning and final results generation. On the diabetes dataset, it performs better than all other baseline methods. Moreover, it s...
Non-local U-Net for Biomedical Image Segmentation.
EN: Deep learning has shown its great promise in various biomedical image segmentation tasks. Existing models are typically based on U-Net and rely on an encoder-decoder architecture with stacked local operators to aggregate long-range information gradually. However, only using the local operators limits the efficiency and effectiveness. In this work, we propose the non-local U-Nets, which are equipped with flexible global aggregation blocks, for biomedical image segmentation. These blocks can be inserted into U-Net as size-preserving processes, as well as down-sampling and up-sampling layers. We perform thorough experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate the non-local U-Nets. Results show that our proposed models achieve top performances with fewer parameters and faster computation.
Non-local U-Net for Biomedical Image Segmentation.
EN: Deep learning has shown its great promise in various biomedical image segmentation tasks. Existing models are typically based on U-Net and rely on an encoder-decoder architecture with stacked local operators to aggregate long-range information gradually. However, only using the local operators limits the efficiency and effectiveness. In this work, we propose the non-local U-Nets, which are equipped with flexible global aggregation blocks, for biomedical image segmentation. These blocks can be inserted into U-Net as size-preserving processes, as well as down-sampling and up-sampling layers. We perform thorough experiments on the 3D multimodality isointense infant brain MR image segmentation task to evaluate the non-local U-Nets. Results show that our proposed models achieve top performances with fewer parameters and faster computation.
PSICA: decision trees for probabilistic subgroup identification with categorical treatments.
EN: Personalized medicine aims at identifying best treatments for a patient with given characteristics. It has been shown in the literature that these methods can lead to great improvements in medicine compared to traditional methods prescribing the same treatment to all patients. Subgroup identification is a branch of personalized medicine which aims at finding subgroups of the patients with similar characteristics for which some of the investigated treatments have a better effect than the other treatments. A number of approaches based on decision trees has been proposed to identify such subgroups, but most of them focus on the two-arm trials (control/treatment) while a few methods consider quantitative treatments (defined by the dose). However, no subgroup identification method exists that can predict the best treatments in a scenario with a categorical set of treatments. We propose a novel method for subgroup identification in categorical treatment scenarios. This method outputs a decision tree showing the probabilities of a given treatment being the best for a given group of patients as well as labels showing the possible best treatments. The method is implemented in an R package ...
Dose finding for new vaccines: the role for immunostimulation/immunodynamic modelling.
EN: Current methods to optimize vaccine dose are purely empirically based, whereas in the drug development field, dosing determinations use far more advanced quantitative methodology to accelerate decision-making. Applying these established methods in the field of vaccine development may reduce the currently large clinical trial sample sizes, long time frames, high costs, and ultimately have a better potential to save lives. We propose the field of immunostimulation/immunodynamic (IS/ID) modelling, which aims to translate mathematical frameworks used for drug dosing towards optimizing vaccine dose decision-making. Analogous to PK/PD modelling, IS/ID modelling approaches apply mathematical models to describe the underlying mechanisms by which the immune response is stimulated by vaccination (IS) and the resulting measured immune response dynamics (ID). To move IS/ID modelling forward, existing datasets and further data on vaccine allometry and dose-dependent dynamics need to be generated and collate, requiring a collaborative environment with input from academia, industry, regulators, governmental and non-governmental agencies to share modelling expertise, and connect modellers to vacci...
Time Series Classification to Improve Poultry Welfare.
EN: Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of automatically assessing the health of some individual birds, and employing the lessons learned to improve the welfare for all birds. This task superficially appears to be easy, given the dramatic progress in recent years in classifying human behaviors, and given that human behaviors are presumably more complex. However, as we shall demonstrate, classifying chicken behaviors poses several unique challenges, chief among which is creating a generalizable dictionary of behaviors from sparse and noisy data. In this work we introduce a novel time series dictionary learning algorithm that can robustly learn from weakly labeled data sources.
Prototypical Clustering Networks for Dermatological Disease Diagnosis.
EN: We consider the problem of image classification for the purpose of aiding doctors in dermatological diagnosis. Dermatological diagnosis poses two major challenges for standard off-the-shelf techniques: First, the data distribution is typically extremely long tailed. Second, intra-class variability is often large. To address the first issue, we formulate the problem as low-shot learning, where once deployed, a base classifier must rapidly generalize to diagnose novel conditions given very few labeled examples. To model diverse classes effectively, we propose Prototypical Clustering Networks (PCN), an extension to Prototypical Networks that learns a mixture of prototypes for each class. Prototypes are initialized for each class via clustering and refined via an online update scheme. Classification is performed by measuring similarity to a weighted combination of prototypes within a class, where the weights are the inferred cluster responsibilities. We demonstrate the strengths of our approach in effective diagnosis on a realistic dataset of dermatological conditions.
Generating equilibrium molecules with deep neural networks.
EN: Discovery of atomistic systems with desirable properties is a major challenge in chemistry and material science. Here we introduce a novel, autoregressive, convolutional deep neural network architecture that generates molecular equilibrium structures by sequentially placing atoms in three-dimensional space. The model estimates the joint probability over molecular configurations with tractable conditional probabilities which only depend on distances between atoms and their nuclear charges. It combines concepts from state-of-the-art atomistic neural networks with auto-regressive generative models for images and speech. We demonstrate that the architecture is capable of generating molecules close to equilibrium for constitutional isomers of C$7$O$_2$H${10}$.
Generating equilibrium molecules with deep neural networks.
EN: Discovery of atomistic systems with desirable properties is a major challenge in chemistry and material science. Here we introduce a novel, autoregressive, convolutional deep neural network architecture that generates molecular equilibrium structures by sequentially placing atoms in three-dimensional space. The model estimates the joint probability over molecular configurations with tractable conditional probabilities which only depend on distances between atoms and their nuclear charges. It combines concepts from state-of-the-art atomistic neural networks with auto-regressive generative models for images and speech. We demonstrate that the architecture is capable of generating molecules close to equilibrium for constitutional isomers of C$7$O$_2$H${10}$.
Generating equilibrium molecules with deep neural networks.
EN: Discovery of atomistic systems with desirable properties is a major challenge in chemistry and material science. Here we introduce a novel, autoregressive, convolutional deep neural network architecture that generates molecular equilibrium structures by sequentially placing atoms in three-dimensional space. The model estimates the joint probability over molecular configurations with tractable conditional probabilities which only depend on distances between atoms and their nuclear charges. It combines concepts from state-of-the-art atomistic neural networks with auto-regressive generative models for images and speech. We demonstrate that the architecture is capable of generating molecules close to equilibrium for constitutional isomers of C$7$O$_2$H${10}$.
Double emulsion drop evaporation and resurfacing of daughter droplet.
EN: In this study, we present experimental and theoretical analyses of double emulsion drop evaporation. After the apparent completion of evaporation of the inner phase of a double emulsion drop, surprisingly, a resurfacing of a daughter droplet is observed. We further investigated to hypothesize this phenomenon which allowed us to obtain a prolonged fixed contact line evaporation for a single phase drop along with similar occurrence of resurfacing as of the double emulsion drops.
BioSentVec: creating sentence embeddings for biomedical texts.
EN: Sentence embeddings have become an essential part of today's natural language processing (NLP) systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings. BioSentVec is publicly available at https://github.com/ncbi-nlp/BioSentVec
BioSentVec: creating sentence embeddings for biomedical texts.
EN: Sentence embeddings have become an essential part of today's natural language processing (NLP) systems, especially together advanced deep learning methods. Although pre-trained sentence encoders are available in the general domain, none exists for biomedical texts to date. In this work, we introduce BioSentVec: the first open set of sentence embeddings trained with over 30 million documents from both scholarly articles in PubMed and clinical notes in the MIMIC-III Clinical Database. We evaluate BioSentVec embeddings in two sentence pair similarity tasks in different text genres. Our benchmarking results demonstrate that the BioSentVec embeddings can better capture sentence semantics compared to the other competitive alternatives and achieve state-of-the-art performance in both tasks. We expect BioSentVec to facilitate the research and development in biomedical text mining and to complement the existing resources in biomedical word embeddings. BioSentVec is publicly available at https://github.com/ncbi-nlp/BioSentVec
Viscosity of protein-stabilised emulsions: contributions of components and development of a semi-predictive model.
EN: Protein-stabilised emulsions can be seen as mixtures of unadsorbed proteins and of protein-stabilised droplets. To identify the contributions of these two components to the overall viscosity of sodium caseinate o/w emulsions, the rheological behaviour of pure suspensions of proteins and droplets were characterised, and their properties used to model the behaviour of their mixtures. These materials are conveniently studied in the framework developed for soft colloids. Here, the use of viscosity models for the two types of pure suspensions facilitates the development of a semi-empirical model that relates the viscosity of protein-stabilised emulsions to their composition.
Pickering emulsions with alpha-cyclodextrin inclusions: Structure and thermal stability.
EN: This paper explores structural, interfacial and thermal properties of two types of Pickering emulsions containing alpha-cyclodextrin inclusion complexes: on one hand, emulsions were obtained between aqueous solutions of alpha-cyclodextrin and different oils (fatty acids, olive oil, silicone oil) and on the other hand, emulsions were obtained between these oils, water and micro or nano-platelet suspensions with inclusion complexes of hydrophobically-modified polysaccharides. The emulsions exhibit versatile properties according to the molecular architecture of the oils. Experiments were performed by microcalorimetry, X-ray diffraction and confocal microscopy. The aptitude of oil molecules to be threaded in alpha-cyclodextrin cavity is a determining parameter in emulsification and thermal stability. The heat flow traces and images showed dissolution, cooperative melting and de- threading of inclusion complexes which take place progressively, ending at high temperatures, close or above 100°C. Another important feature observed in the emulsions with micro-platelets is the partial substitution of the guest molecules occurring at room temperature at the oil/water interfaces without dissol...
Skinny emulsions take on granular matter.
EN: Our understanding of the structural features of foams and emulsions has advanced significantly over the last 20 years. However, with a search for "super-stable" liquid dispersions, foam and emulsion science employs increasingly complex formulations which create solid-like visco-elastic layers at the bubble/drop surfaces. These lead to elastic, adhesive and frictional forces between bubbles/drops, impacting strongly how they pack and deform against each other, asking for an adaptation of the currently available structural description. The possibility to modify systematically the interfacial properties makes these dispersions ideal systems for the exploration of soft granular materials with complex interactions. We present here a first systematic analysis of the structural features of such a system using a model silicone emulsion containing millimetre-sized polyethylene glycol drops (PEG). Solid-like drop surfaces are obtained by polymeric cross-linking reactions at the PEG-silicone interface. Using a novel droplet-micromanipulator, we highlight the presence of elastic, adhesive and frictional interactions between two drops. We then provide for the first time a full tomographic ana...
Optimal vaccine allocation during the mumps outbreak in two SIR centers.
EN: The aim of this work is to investigate the optimal vaccine sharing between two SIR centers in the presence of migration fluxes of susceptibles and infected individuals during the mumps outbreak. Optimality of the vaccine allocation means the minimization of the total number of lost working days during the whole period of epidemic outbreak $[0,t_f]$, which can be described by the functional $Q=\int_0^{t_f}I(t){\rm d}t$ where $I(t)$ stands for the number of infectives at time $t$. We explain the behavior of the optimal allocation, which depends on the model parameters and the amount of available vaccine.
Deep learning for in vitro prediction of pharmaceutical formulations.
EN: Current pharmaceutical formulation development still strongly relies on the traditional trial-and-error approach by individual experiences of pharmaceutical scientists, which is laborious, time-consuming and costly. Recently, deep learning has been widely applied in many challenging domains because of its important capability of automatic feature extraction. The aim of this research is to use deep learning to predict pharmaceutical formulations. In this paper, two different types of dosage forms were chosen as model systems. Evaluation criteria suitable for pharmaceutics were applied to assessing the performance of the models. Moreover, an automatic dataset selection algorithm was developed for selecting the representative data as validation and test datasets. Six machine learning methods were compared with deep learning. The result shows the accuracies of both two deep neural networks were above 80% and higher than other machine learning models, which showed good prediction in pharmaceutical formulations. In summary, deep learning with the automatic data splitting algorithm and the evaluation criteria suitable for pharmaceutical formulation data was firstly developed for the predi...
Texture changes during thermal processing of food: experiments and modelling.
EN: Texture is an important attribute in the quality assessment of processed food products. Youngs modulus is an indirect measure of texture. During thermal treatment of hygroscopic foods, parameters such as moisture content significantly affect Youngs modulus. However, the sensitivity to these parameters has not yet been quantified in terms of the stress strain behaviour. We have built an experimentally validated model to address this gap. This paper presents the stress strain behaviour and its sensitivity towards various parameters. Experiments are conducted with potato samples for stress strain behaviour, parametric sensitivity analysis, estimation of initial and critical values of moisture content and Youngs modulus. We found that the Youngs modulus and the ultimate strength vary by as much as 54 percent and 29 percent depending on the rate of applied strain, indicating the need for test standards. Further, we propose a model to predict the local Youngs moduli as a function of moisture content, and a relationship between these and the effective Youngs modulus. While model results agree well for drying, they deviate by as much as 16 percent from experiments for frying, indicating th...
Comparative study of Discrete Wavelet Transforms and Wavelet Tensor Train decomposition to feature extraction of FTIR data of medicinal plants.
EN: Fourier-transform infra-red (FTIR) spectra of samples from 7 plant species were used to explore the influence of preprocessing and feature extraction on efficiency of machine learning algorithms. Wavelet Tensor Train (WTT) and Discrete Wavelet Transforms (DWT) were compared as feature extraction techniques for FTIR data of medicinal plants. Various combinations of signal processing steps showed different behavior when applied to classification and clustering tasks. Best results for WTT and DWT found through grid search were similar, significantly improving quality of clustering as well as classification accuracy for tuned logistic regression in comparison to original spectra. Unlike DWT, WTT has only one parameter to be tuned (rank), making it a more versatile and easier to use as a data processing tool in various signal processing applications.
Anomaly Detection for Skin Disease Images Using Variational Autoencoder.
EN: In this paper, we demonstrate the potential of applying Variational Autoencoder (VAE) [10] for anomaly detection in skin disease images. VAE is a class of deep generative models which is trained by maximizing the evidence lower bound of data distribution [10]. When trained on only normal data, the resulting model is able to perform efficient inference and to determine if a test image is normal or not. We perform experiments on ISIC2018 Challenge Disease Classification dataset (Task 3) and compare different methods to use VAE to detect anomaly. The model is able to detect all diseases with 0.779 AUCROC. If we focus on specific diseases, the model is able to detect melanoma with 0.864 AUCROC and detect actinic keratosis with 0.872 AUCROC, even if it only sees the images of nevus. To the best of our knowledge, this is the first applied work of deep generative models for anomaly detection in dermatology.
chemmodlab: A Cheminformatics Modeling Laboratory for Fitting and Assessing Machine Learning Models.
EN: The goal of chemmodlab is to streamline the fitting and assessment pipeline for many machine learning models in R, making it easy for researchers to compare the utility of new models. While focused on implementing methods for model fitting and assessment that have been accepted by experts in the cheminformatics field, all of the methods in chemmodlab have broad utility for the machine learning community. chemmodlab contains several assessment utilities including a plotting function that constructs accumulation curves and a function that computes many performance measures. The most novel feature of chemmodlab is the ease with which statistically significant performance differences for many machine learning models is presented by means of the multiple comparisons similarity plot. Differences are assessed using repeated k-fold cross validation where blocking increases precision and multiplicity adjustments are applied.
Physics of Active Emulsions.
EN: Phase separating systems that are maintained away from thermodynamic equilibrium via molecular processes represent a class of active systems, which we call active emulsions. These systems are driven by external energy input for example provided by an external fuel reservoir. The external energy input gives rise to novel phenomena that are not present in passive systems. For instance, concentration gradients can spatially organise emulsions and cause novel droplet size distributions. Another example are active droplets that are subject to chemical reactions such that their nucleation and size can be controlled and they can spontaneously divide. In this review we discuss the physics of phase separation and emulsions and show how the concepts that governs such phenomena can be extended to capture the physics of active emulsions. This physics is relevant to the spatial organisation of the biochemistry in living cells, for the development novel applications in chemical engineering and models for the origin of life.
Improving Chemical Autoencoder Latent Space and Molecular De novo Generation Diversity with Heteroencoders.
EN: Chemical autoencoders are attractive models as they combine chemical space navigation with possibilities for de-novo molecule generation in areas of interest. This enables them to produce focused chemical libraries around a single lead compound for employment early in a drug discovery project. Here it is shown that the choice of chemical representation, such as SMILES strings, has a large influence on the properties of the latent space. It is further explored to what extent translating between different chemical representations influences the latent space similarity to the SMILES strings or circular fingerprints. By employing SMILES enumeration for either the encoder or decoder, it is found that the decoder has the largest influence on the properties of the latent space. Training a sequence to sequence heteroencoder based on recurrent neural networks(RNNs) with long short-term memory cells (LSTM) to predict different enumerated SMILES strings from the same canonical SMILES string gives the largest similarity between latent space distance and molecular similarity measured as circular fingerprints similarity. Using the output from the bottleneck in QSAR modelling of five molecular da...
N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules.
EN: Machine learning techniques have recently been adopted in various applications in medicine, biology, chemistry, and material engineering. An important task is to predict the properties of molecules, which serves as the main subroutine in many downstream applications such as virtual screening and drug design. Despite the increasing interest, the key challenge is to construct proper representations of molecules for learning algorithms. This paper introduces the N-gram graph, a simple unsupervised representation for molecules. The method first embeds the vertices in the molecule graph. It then constructs a compact representation for the graph by assembling the vertex embeddings in short walks in the graph, which we show is equivalent to a simple graph neural network that needs no training. The representations can thus be efficiently computed and then used with supervised learning methods for prediction. Experiments on 60 tasks from 10 benchmark datasets demonstrate its advantages over both popular graph neural networks and traditional representation methods. This is complemented by theoretical analysis showing its strong representation and prediction power.
ToxicBlend: Virtual Screening of Toxic Compounds with Ensemble Predictors.
EN: Timely assessment of compound toxicity is one of the biggest challenges facing the pharmaceutical industry today. A significant proportion of compounds identified as potential leads are ultimately discarded due to the toxicity they induce. In this paper, we propose a novel machine learning approach for the prediction of molecular activity on ToxCast targets. We combine extreme gradient boosting with fully-connected and graph-convolutional neural network architectures trained on QSAR physical molecular property descriptors, PubChem molecular fingerprints, and SMILES sequences. Our ensemble predictor leverages the strengths of each individual technique, significantly outperforming existing state-of-the art models on the ToxCast and Tox21 toxicity-prediction datasets. We provide free access to molecule toxicity prediction using our model at http://www.owkin.com/toxicblend.
Chemical Oscillation in Ultracold Chemistry.
EN: We demonstrate the occurrence of oscillatory reactions in the ultra-cold chemistry of atom-molecular Bose-Einstein condensate. Nonlinear oscillations in the mean-field dynamics occur for a specific range of elliptic modulus, giving rise to both in- and out-phase modulations in the atom-molecule population density. The reaction front velocity is found to be controlled by photoassociation, which also regulates the condensate density. Two distinct pair of in-phase bright localized gap solitons are found as exact solutions, existence of one of which necessarily requires a background. Cnoidal atomic density-waves in a plane wave molecular background are observed in both attractive and repulsive domains. Role of intra- and inter-species interactions on both existence and stability is explicated in the presence of photoassociation.
Chemical Oscillation in Ultracold Chemistry.
EN: We demonstrate the occurrence of oscillatory reactions in the ultra-cold chemistry of atom-molecular Bose-Einstein condensate. Nonlinear oscillations in the mean-field dynamics occur for a specific range of elliptic modulus, giving rise to both in- and out-phase modulations in the atom-molecule population density. The reaction front velocity is found to be controlled by photoassociation, which also regulates the condensate density. Two distinct pair of in-phase bright localized gap solitons are found as exact solutions, existence of one of which necessarily requires a background. Cnoidal atomic density-waves in a plane wave molecular background are observed in both attractive and repulsive domains. Role of intra- and inter-species interactions on both existence and stability is explicated in the presence of photoassociation.
Chemical Oscillation in Ultracold Chemistry.
EN: We demonstrate the occurrence of oscillatory reactions in the ultra-cold chemistry of atom-molecular Bose-Einstein condensate. Nonlinear oscillations in the mean-field dynamics occur for a specific range of elliptic modulus, giving rise to both in- and out-phase modulations in the atom-molecule population density. The reaction front velocity is found to be controlled by photoassociation, which also regulates the condensate density. Two distinct pair of in-phase bright localized gap solitons are found as exact solutions, existence of one of which necessarily requires a background. Cnoidal atomic density-waves in a plane wave molecular background are observed in both attractive and repulsive domains. Role of intra- and inter-species interactions on both existence and stability is explicated in the presence of photoassociation.
Impact of group management and transfer on individual sociality in Highland cattle (Bos taurus).
EN: The sociality of cattle facilitates the maintenance of herd cohesion and synchronisation, making these species the ideal choice for domestication as livestock for humans. However, livestock populations are not self-regulated, and farmers transfer individuals across different groups throughout their lives for reasons such as genetic mixing, reproduction and pastureland management. Individuals consequently have to adapt to different group compositions during their lives rather than choose their own herd mates, as they would do in the wild. These changes may lead to social instability and stress, entailing potentially negative effects on animal welfare. In this study, we assess how the transfer of Highland cattle (Bos taurus) impacts individual and group social network measures. Four groups with nine different compositions and 18 individual transfers were studied to evaluate 1) the effect of group composition on individual social centralities and 2) the effect of group composition changes on these centralities. As shown in previous works, dyadic associations are stronger between individuals with similar age and dominance rank. This study reveals that the relative stability of dyadic s...
Automated Detection of Adverse Drug Reactions in the Biomedical Literature Using Convolutional Neural Networks and Biomedical Word Embeddings.
EN: Monitoring the biomedical literature for cases of Adverse Drug Reactions (ADRs) is a critically important and time consuming task in pharmacovigilance. The development of computer assisted approaches to aid this process in different forms has been the subject of many recent works. One particular area that has shown promise is the use of Deep Neural Networks, in particular, Convolutional Neural Networks (CNNs), for the detection of ADR relevant sentences. Using token-level convolutions and general purpose word embeddings, this architecture has shown good performance relative to more traditional models as well as Long Short Term Memory (LSTM) models. In this work, we evaluate and compare two different CNN architectures using the ADE corpus. In addition, we show that by de-duplicating the ADR relevant sentences, we can greatly reduce overoptimism in the classification results. Finally, we evaluate the use of word embeddings specifically developed for biomedical text and show that they lead to a better performance in this task.
Automated Detection of Adverse Drug Reactions in the Biomedical Literature Using Convolutional Neural Networks and Biomedical Word Embeddings.
EN: Monitoring the biomedical literature for cases of Adverse Drug Reactions (ADRs) is a critically important and time consuming task in pharmacovigilance. The development of computer assisted approaches to aid this process in different forms has been the subject of many recent works. One particular area that has shown promise is the use of Deep Neural Networks, in particular, Convolutional Neural Networks (CNNs), for the detection of ADR relevant sentences. Using token-level convolutions and general purpose word embeddings, this architecture has shown good performance relative to more traditional models as well as Long Short Term Memory (LSTM) models. In this work, we evaluate and compare two different CNN architectures using the ADE corpus. In addition, we show that by de-duplicating the ADR relevant sentences, we can greatly reduce overoptimism in the classification results. Finally, we evaluate the use of word embeddings specifically developed for biomedical text and show that they lead to a better performance in this task.
QSAR Classification Modeling for Bioactivity of Molecular Structure via SPL-Logsum.
EN: Quantitative structure-activity relationship (QSAR) modelling is effective 'bridge' to search the reliable relationship related bioactivity to molecular structure. A QSAR classification model contains a lager number of redundant, noisy and irrelevant descriptors. To address this problem, various of methods have been proposed for descriptor selection. Generally, they can be grouped into three categories: filters, wrappers, and embedded methods. Regularization method is an important embedded technology, which can be used for continuous shrinkage and automatic descriptors selection. In recent years, the interest of researchers in the application of regularization techniques is increasing in descriptors selection , such as, logistic regression(LR) with $L_1$ penalty. In this paper, we proposed a novel descriptor selection method based on self-paced learning(SPL) with Logsum penalized LR for predicting the bioactivity of molecular structure. SPL inspired by the learning process of humans and animals that gradually learns from easy samples(smaller losses) to hard samples(bigger losses) samples into training and Logsum regularization has capacity to select few meaningful and significant m...
Accelerating Prototype-Based Drug Discovery using Conditional Diversity Networks.
EN: Designing a new drug is a lengthy and expensive process. As the space of potential molecules is very large (10^23-10^60), a common technique during drug discovery is to start from a molecule which already has some of the desired properties. An interdisciplinary team of scientists generates hypothesis about the required changes to the prototype. In this work, we develop an algorithmic unsupervised-approach that automatically generates potential drug molecules given a prototype drug. We show that the molecules generated by the system are valid molecules and significantly different from the prototype drug. Out of the compounds generated by the system, we identified 35 FDA-approved drugs. As an example, our system generated Isoniazid - one of the main drugs for Tuberculosis. The system is currently being deployed for use in collaboration with pharmaceutical companies to further analyze the additional generated molecules.
Cavity-controlled ultracold chemistry.
EN: Ultracold ground-state molecules can be formed from ultracold atoms via photoassociation followed by a spontaneous emission process. Typically, the molecular products are distributed over a range of final states. Here, we propose to use an optical cavity with high cooperativity to selectively enhance the population of a pre-determined final state by controlling the spontaneous emission. During this process, a photon will be emitted into the cavity mode. Detection of this photon heralds a single reaction. We discuss the efficiency and the dynamics of cavity-assisted molecule formation in the frame of realistic parameters that can be achieved in current ultracold-atom setups. In particular, we consider the production of Rb$_2$ molecules in the $a^3Σ_u$ triplet ground state. Moreover, when working with more than two atoms in the cavity, collective enhancement effects in chemistry should be observable.
Cavity-controlled ultracold chemistry.
EN: Ultracold ground-state molecules can be formed from ultracold atoms via photoassociation followed by a spontaneous emission process. Typically, the molecular products are distributed over a range of final states. Here, we propose to use an optical cavity with high cooperativity to selectively enhance the population of a pre-determined final state by controlling the spontaneous emission. During this process, a photon will be emitted into the cavity mode. Detection of this photon heralds a single reaction. We discuss the efficiency and the dynamics of cavity-assisted molecule formation in the frame of realistic parameters that can be achieved in current ultracold-atom setups. In particular, we consider the production of Rb$_2$ molecules in the $a^3Σ_u$ triplet ground state. Moreover, when working with more than two atoms in the cavity, collective enhancement effects in chemistry should be observable.
Cavity-controlled ultracold chemistry.
EN: Ultracold ground-state molecules can be formed from ultracold atoms via photoassociation followed by a spontaneous emission process. Typically, the molecular products are distributed over a range of final states. Here, we propose to use an optical cavity with high cooperativity to selectively enhance the population of a pre-determined final state by controlling the spontaneous emission. During this process, a photon will be emitted into the cavity mode. Detection of this photon heralds a single reaction. We discuss the efficiency and the dynamics of cavity-assisted molecule formation in the frame of realistic parameters that can be achieved in current ultracold-atom setups. In particular, we consider the production of Rb$_2$ molecules in the $a^3Σ_u$ triplet ground state. Moreover, when working with more than two atoms in the cavity, collective enhancement effects in chemistry should be observable.
Visualizing Convolutional Neural Network Protein-Ligand Scoring.
EN: Protein-ligand scoring is an important step in a structure-based drug design pipeline. Selecting a correct binding pose and predicting the binding affinity of a protein-ligand complex enables effective virtual screening. Machine learning techniques can make use of the increasing amounts of structural data that are becoming publicly available. Convolutional neural network (CNN) scoring functions in particular have shown promise in pose selection and affinity prediction for protein-ligand complexes. Neural networks are known for being difficult to interpret. Understanding the decisions of a particular network can help tune parameters and training data to maximize performance. Visualization of neural networks helps decompose complex scoring functions into pictures that are more easily parsed by humans. Here we present three methods for visualizing how individual protein-ligand complexes are interpreted by 3D convolutional neural networks. We also present a visualization of the convolutional filters and their weights. We describe how the intuition provided by these visualizations aids in network design.
Polariton Chemistry: controlling molecular dynamics with optical cavities.
EN: Molecular polaritons are the optical excitations which emerge when molecular transitions interact strongly with confined electromagnetic fields. Increasing interest in the hybrid molecular-photonic materials that host these excitations stems from recent observations of their novel and tunable chemistry. Some of the remarkable functionalities exhibited by polaritons include the ability to induce long-range excitation energy transfer, enhance charge conductivity, and inhibit or enhance chemical reactions. In this review, we explain the effective theories of molecular polaritons which form a basis for the interpretation and guidance of experiments at the strong coupling limit. The theoretical discussion is illustrated with the analysis of innovative applications of strongly coupled molecular-photonic systems to chemical phenomena of fundamental importance to future technologies.
Polariton Chemistry: controlling molecular dynamics with optical cavities.
EN: Molecular polaritons are the optical excitations which emerge when molecular transitions interact strongly with confined electromagnetic fields. Increasing interest in the hybrid molecular-photonic materials that host these excitations stems from recent observations of their novel and tunable chemistry. Some of the remarkable functionalities exhibited by polaritons include the ability to induce long-range excitation energy transfer, enhance charge conductivity, and inhibit or enhance chemical reactions. In this review, we explain the effective theories of molecular polaritons which form a basis for the interpretation and guidance of experiments at the strong coupling limit. The theoretical discussion is illustrated with the analysis of innovative applications of strongly coupled molecular-photonic systems to chemical phenomena of fundamental importance to future technologies.
Polariton Chemistry: controlling molecular dynamics with optical cavities.
EN: Molecular polaritons are the optical excitations which emerge when molecular transitions interact strongly with confined electromagnetic fields. Increasing interest in the hybrid molecular-photonic materials that host these excitations stems from recent observations of their novel and tunable chemistry. Some of the remarkable functionalities exhibited by polaritons include the ability to induce long-range excitation energy transfer, enhance charge conductivity, and inhibit or enhance chemical reactions. In this review, we explain the effective theories of molecular polaritons which form a basis for the interpretation and guidance of experiments at the strong coupling limit. The theoretical discussion is illustrated with the analysis of innovative applications of strongly coupled molecular-photonic systems to chemical phenomena of fundamental importance to future technologies.
Lamellar ordering, droplet formation and phase inversion in exotic active emulsions.
EN: We study numerically the behaviour of a mixture of a passive isotropic fluid and an active polar gel, in the presence of a surfactant favouring emulsification. Focussing on parameters for which the underlying free energy favours the lamellar phase in the passive limit, we show that the interplay between nonequilibrium and thermodynamic forces creates a range of multifarious exotic emulsions. When the active component is contractile (e.g., an actomyosin solution), moderate activity enhances the efficiency of lamellar ordering, whereas strong activity favours the creation of passive droplets within an active matrix. For extensile activity (occurring, e.g., in microtubule-motor suspensions), instead, we observe an emulsion of spontaneously rotating droplets of different size. By tuning the overall composition, we can create high internal phase emulsions, which undergo sudden phase inversion when activity is switched off. Therefore, we find that activity provides a single control parameter to design composite materials with a strikingly rich range of morphologies.
Supervised classification of Dermatological diseases by Deep learning.
EN: This paper introduces a deep-learning based efficient classifier for common dermatological conditions, aimed at people without easy access to skin specialists. We report approximately 80% accuracy, in a situation where primary care doctors have attained 57% success rate, according to recent literature. The rationale of its design is centered on deploying and updating it on handheld devices in near future. Dermatological diseases are common in every population and have a wide spectrum in severity. With a shortage of dermatological expertise being observed in several countries, machine learning solutions can augment medical services and advise regarding existence of common diseases. The paper implements supervised classification of nine distinct conditions which have high occurrence in East Asian countries. Our current attempt establishes that deep learning based techniques are viable avenues for preliminary information to aid patients.
Fabrication and characterization of pH responsive nanoprobes based on ion current rectification.
EN: In this study, we investigated the ionic current rectification of glass nanopipettes modified with bovine serum albumin - glutaraldehyde (BSA-GA) artificial membrane using solutions with various pHs. Ionic current rectification is a phenomenon that is observed with nanopores as asymmetric I-V curves, where the ionic currents recorded through a nanopore differ at the same magnitude of applied electrical potentials biased with opposite polarities. The results clearly showed that modifying the tip of a nanopipette results in a pH dependent ionic current behavior. The proposed strategy is a facile method for fabrication of a pH responsive nanoprobe that has a potential for intracellular pH measurement.
A novel methodology on distributed representations of proteins using their interacting ligands.
EN: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand based approach can be utilized in protein representation. In this study, we propose SMILESVec, a SMILES-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, BLAST and ProtVec, and two compound fingerprint based protein representation methods are compared. We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein-sequence based representation methods in protein clustering. The results suggest that ligand-based ...
Automatic construction of Chinese herbal prescription from tongue image via CNNs and auxiliary latent therapy topics.
EN: The tongue image provides important physical information of humans. It is of great importance for diagnoses and treatments in clinical medicine. Herbal prescriptions are simple, noninvasive and have low side effects. Thus, they are widely applied in China. Studies on the automatic construction technology of herbal prescriptions based on tongue images have great significance for deep learning to explore the relevance of tongue images for herbal prescriptions, it can be applied to healthcare services in mobile medical systems. In order to adapt to the tongue image in a variety of photographic environments and construct herbal prescriptions, a neural network framework for prescription construction is designed. It includes single/double convolution channels and fully connected layers. Furthermore, it proposes the auxiliary therapy topic loss mechanism to model the therapy of Chinese doctors and alleviate the interference of sparse output labels on the diversity of results. The experiment use the real world tongue images and the corresponding prescriptions and the results can generate prescriptions that are close to the real samples, which verifies the feasibility of the proposed method...
Multi-Objective De Novo Drug Design with Conditional Graph Generative Model.
EN: Recently, deep generative models have revealed itself as a promising way of performing de novo molecule design. However, previous research has focused mainly on generating SMILES strings instead of molecular graphs. Although current graph generative models are available, they are often too general and computationally expensive, which restricts their application to molecules with small sizes. In this work, a new de novo molecular design framework is proposed based on a type sequential graph generators that do not use atom level recurrent units. Compared with previous graph generative models, the proposed method is much more tuned for molecule generation and have been scaled up to cover significantly larger molecules in the ChEMBL database. It is shown that the graph-based model outperforms SMILES based models in a variety of metrics, especially in the rate of valid outputs. For the application of drug design tasks, conditional graph generative model is employed. This method offers higher flexibility compared to previous fine-tuning based approach and is suitable for generation based on multiple objectives. This approach is applied to solve several drug design problems, including the...
In silico generation of novel, drug-like chemical matter using the LSTM neural network.
EN: The exploration of novel chemical spaces is one of the most important tasks of cheminformatics when supporting the drug discovery process. Properly designed and trained deep neural networks can provide a viable alternative to brute-force de novo approaches or various other machine-learning techniques for generating novel drug-like molecules. In this article we present a method to generate molecules using a long short-term memory (LSTM) neural network and provide an analysis of the results, including a virtual screening test. Using the network one million drug-like molecules were generated in 2 hours. The molecules are novel, diverse (contain numerous novel chemotypes), have good physicochemical properties and have good synthetic accessibility, even though these qualities were not specific constraints. Although novel, their structural features and functional groups remain closely within the drug-like space defined by the bioactive molecules from ChEMBL. Virtual screening using the profile QSAR approach confirms that the potential of these novel molecules to show bioactivity is comparable to the ChEMBL set from which they were derived. The molecule generator written in Python used in...
Decoupled molecules with binding polynomials of bidegree (n,2).
EN: We present a result on the number of decoupled molecules for systems binding two different types of ligands. In the case of $n$ and $2$ binding sites respectively, we show that, generically, there are $2(n!)^{2}$ decoupled molecules with the same binding polynomial. For molecules with more binding sites for the second ligand, we provide computational results.
Diamondoid Molecules.
EN: In this review paper we introduce at first the cage nature of diamondoid molecules, the variety of their crystalline lattice structures, the nature of their structural isomers, their stereoisomers, and their other molecular specificities. The natural occurrence of diamondoids in petroleum fluids and how they come to be present in such fluids is introduced. Field experiences of phase transitions and depositions as well as techniques for separation, detection and measurement of diamondoids from petroleum fluids is presented and discussed. It is demonstrated that due to their six or more linking groups diamondoids have found major applications as templates and as molecular building blocks in polymers synthesis, nanotechnology, drug delivery, drug targeting, DNA directed assembly, DNA-amino acid nanostructure formation and in host-guest chemistry.
Diamondoid Molecules.
EN: In this review paper we introduce at first the cage nature of diamondoid molecules, the variety of their crystalline lattice structures, the nature of their structural isomers, their stereoisomers, and their other molecular specificities. The natural occurrence of diamondoids in petroleum fluids and how they come to be present in such fluids is introduced. Field experiences of phase transitions and depositions as well as techniques for separation, detection and measurement of diamondoids from petroleum fluids is presented and discussed. It is demonstrated that due to their six or more linking groups diamondoids have found major applications as templates and as molecular building blocks in polymers synthesis, nanotechnology, drug delivery, drug targeting, DNA directed assembly, DNA-amino acid nanostructure formation and in host-guest chemistry.
Diamondoid Molecules.
EN: In this review paper we introduce at first the cage nature of diamondoid molecules, the variety of their crystalline lattice structures, the nature of their structural isomers, their stereoisomers, and their other molecular specificities. The natural occurrence of diamondoids in petroleum fluids and how they come to be present in such fluids is introduced. Field experiences of phase transitions and depositions as well as techniques for separation, detection and measurement of diamondoids from petroleum fluids is presented and discussed. It is demonstrated that due to their six or more linking groups diamondoids have found major applications as templates and as molecular building blocks in polymers synthesis, nanotechnology, drug delivery, drug targeting, DNA directed assembly, DNA-amino acid nanostructure formation and in host-guest chemistry.
Regularization approaches for support vector machines with applications to biomedical data.
EN: The support vector machine (SVM) is a widely used machine learning tool for classification based on statistical learning theory. Given a set of training data, the SVM finds a hyperplane that separates two different classes of data points by the largest distance. While the standard form of SVM uses L2-norm regularization, other regularization approaches are particularly attractive for biomedical datasets where, for example, sparsity and interpretability of the classifier's coefficient values are highly desired features. Therefore, in this paper we consider different types of regularization approaches for SVMs, and explore them in both synthetic and real biomedical datasets.
Regularization approaches for support vector machines with applications to biomedical data.
EN: The support vector machine (SVM) is a widely used machine learning tool for classification based on statistical learning theory. Given a set of training data, the SVM finds a hyperplane that separates two different classes of data points by the largest distance. While the standard form of SVM uses L2-norm regularization, other regularization approaches are particularly attractive for biomedical datasets where, for example, sparsity and interpretability of the classifier's coefficient values are highly desired features. Therefore, in this paper we consider different types of regularization approaches for SVMs, and explore them in both synthetic and real biomedical datasets.
Dynamics of vaccination in a time-delayed epidemic model with awareness.
EN: This paper investigates the effects of vaccination on the dynamics of infectious disease, which is spreading in a population concurrently with awareness. The model considers contributions to the overall awareness from a global information campaign, direct contacts between unaware and aware individuals, and reported cases of infection. It is assumed that there is some time delay between individuals becoming aware and modifying their behaviour. Vaccination is administered to newborns, as well as to aware individuals, and it is further assumed that vaccine-induced immunity may wane with time. Feasibility and stability of the disease-free and endemic equilibria are studied analytically, and conditions for the Hopf bifurcation of the endemic steady state are found in terms of system parameters and the time delay. Analytical results are supported by numerical continuation of the Hopf bifurcation and numerical simulations of the model to illustrate different types of dynamical behaviour.
Five-dimensional imaging of freezing emulsions with solute effects.
EN: The interaction of objects with a moving solidification front is a common feature of many industrial and natural processes such as metal processing, the growth of single-crystals, the cryopreservation of cells, or the formation of sea ice. Solidification fronts interact with objects with different outcomes, from the total rejection to their complete engulfment. We image the freezing of emulsions in 5D (space, time, and solute concentration) with confocal microscopy. We show the solute induces long-range interactions that determine the solidification microstructure. The local increase of solute concentration enhances premelting, which controls the engulfment of droplets by the front and the evolution of grain boundaries. Freezing emulsions may be a good analogue of many solidification systems where objects interact with a solidification interface.
SUBIC: A Supervised Bi-Clustering Approach for Precision Medicine.
EN: Traditional medicine typically applies one-size-fits-all treatment for the entire patient population whereas precision medicine develops tailored treatment schemes for different patient subgroups. The fact that some factors may be more significant for a specific patient subgroup motivates clinicians and medical researchers to develop new approaches to subgroup detection and analysis, which is an effective strategy to personalize treatment. In this study, we propose a novel patient subgroup detection method, called Supervised Biclustring (SUBIC) using convex optimization and apply our approach to detect patient subgroups and prioritize risk factors for hypertension (HTN) in a vulnerable demographic subgroup (African-American). Our approach not only finds patient subgroups with guidance of a clinically relevant target variable but also identifies and prioritizes risk factors by pursuing sparsity of the input variables and encouraging similarity among the input variables and between the input and target variables
Random Forests of Interaction Trees for Estimating Individualized Treatment Effects in Randomized Trials.
EN: Assessing heterogeneous treatment effects has become a growing interest in advancing precision medicine. Individualized treatment effects (ITE) play a critical role in such an endeavor. Concerning experimental data collected from randomized trials, we put forward a method, termed random forests of interaction trees (RFIT), for estimating ITE on the basis of interaction trees (Su et al., 2009). To this end, we first propose a smooth sigmoid surrogate (SSS) method, as an alternative to greedy search, to speed up tree construction. RFIT outperforms the traditional `separate regression' approach in estimating ITE. Furthermore, standard errors for the estimated ITE via RFIT can be obtained with the infinitesimal jackknife method. We assess and illustrate the use of RFIT via both simulation and the analysis of data from an acupuncture headache trial.
Meta-QSAR: a large-scale application of meta-learning to drug design and discovery.
EN: We investigate the learning of quantitative structure activity relationships (QSARs) as a case-study of meta-learning. This application area is of the highest societal importance, as it is a key step in the development of new medicines. The standard QSAR learning problem is: given a target (usually a protein) and a set of chemical compounds (small molecules) with associated bioactivities (e.g. inhibition of the target), learn a predictive mapping from molecular representation to activity. Although almost every type of machine learning method has been applied to QSAR learning there is no agreed single best way of learning QSARs, and therefore the problem area is well-suited to meta-learning. We first carried out the most comprehensive ever comparison of machine learning methods for QSAR learning: 18 regression methods, 6 molecular representations, applied to more than 2,700 QSAR problems. (These results have been made publicly available on OpenML and represent a valuable resource for testing novel meta-learning methods.) We then investigated the utility of algorithm selection for QSAR problems. We found that this meta-learning approach outperformed the best individual QSAR learning ...
Water-based peeling of thin hydrophobic films.
EN: Inks of permanent markers and water-proof cosmetics create elastic thin films upon application on a surface. Such adhesive materials are deliberately designed to exhibit water-repellent behavior. Therefore, patterns made up of these inks become resistant to moisture and cannot be cleaned by water after drying. However, we show that sufficiently slow dipping of such elastic films, which are adhered to a substrate, into a bath of pure water allows complete removal of the hydrophobic coatings. Upon dipping, the air-water interface in the bath forms a contact line on the substrate, which exerts a capillary-induced peeling force at the edge of the hydrophobic thin film. We highlight that this capillary peeling process is more effective at lower velocities of the air-liquid interface and lower viscosities. Capillary peeling not only removes such thin films from the substrate but also transfers them flawlessly onto the air-water interface.
Automatic differential analysis of NMR experiments in complex samples.
EN: Liquid state NMR is a powerful tool for the analysis of complex mixtures of unknown molecules. This capacity has been used in many analytical approaches: metabolomics, identification of active compounds in natural extracts, characterization of species, and such studies require the acquisition of many diverse NMR measurements on series of samples. While acquisition can easily be performed automatically, the number of NMR experiments involved in these studies increases very rapidly and this data avalanche requires to resort to automatic processing and analysis. We present here a program that allows the autonomous, unsupervised processing of a large corpus of 1D, 2D and DOSY experiments from a series of samples acquired in different conditions. The program provides all the signal processing steps, as well as peak-picking and bucketing of 1D and 2D spectra, the program and its components are fully available. In an experiment mimicking the search of an active species in natural extract, we use it for the automatic detection of small amounts of artemisin added to a series of plant extracts, and for the generation of the spectral fingerprint of this molecules. This program called Pl...
Locating large flexible ligands on proteins.
EN: Many biologically important ligands of proteins are large, flexible, and often charged molecules that bind to extended regions on the protein surface. It is infeasible or expensive to locate such ligands on proteins with standard methods such as docking or molecular dynamics (MD) simulation. The alternative approach proposed here is the scanning of a spatial and angular grid around the protein with smaller fragments of the large ligand. Energy values for complete grids can be computed efficiently with a well-known Fast Fourier Transform accelerated algorithm and a physically meaningful interaction model. We show that the approach can readily incorporate flexibility of protein and ligand. The energy grids (EGs) resulting from the ligand fragment scans can be transformed into probability distributions, and then directly compared to probability distributions estimated from MD simulations and experimental structural data. We test the approach on a diverse set of complexes between proteins and large, flexible ligands, including a complex of Sonic Hedgehog protein and heparin, three heparin sulfate substrates or non-substrates of an epimerase, a multi-branched supramolecular ligand that ...
Neural Machine Translation between Herbal Prescriptions and Diseases.
EN: The current study applies deep learning to herbalism. Toward the goal, we acquired the de-identified health insurance reimbursements that were claimed in a 10-year period from 2004 to 2013 in the National Health Insurance Database of Taiwan, the total number of reimbursement records equaling 340 millions. Two artificial intelligence techniques were applied to the dataset: residual convolutional neural network multitask classifier and attention-based recurrent neural network. The former works to translate from herbal prescriptions to diseases; and the latter from diseases to herbal prescriptions. Analysis of the classification results indicates that herbal prescriptions are specific to: anatomy, pathophysiology, sex and age of the patient, and season and year of the prescription. Further analysis identifies temperature and gross domestic product as the meteorological and socioeconomic factors that are associated with herbal prescriptions. Analysis of the neural machine transitional result indicates that the recurrent neural network learnt not only syntax but also semantics of diseases and herbal prescriptions.
On role of matrix behavior in compressive fracture of bovine cortical bone.
EN: In compressive fracture of dry plexiform bone, we examine the individual roles of overall mean porosity, the connectivity of the porosity network, and the elastic as well as the failure properties of the non-porous matrix, using a random spring network model. Porosity network structure is shown to reduce the compressive strength by upto 30%. However, the load bearing capacity increases with increase in either of the matrix properties - elastic modulus or failure strain threshold. To validate the porosity-based RSNM model with available experimental data, bone-specific failure strain thresholds for the ideal matrix of similar elastic properties were estimated to be within 60% of each other. Further, we observe the avalanche size exponents to be independent of the bone-dependent parameters as well as the structure of the porosity network.
Well-supported phylogenies using largest subsets of core-genes by discrete particle swarm optimization.
EN: The number of complete chloroplastic genomes increases day after day, making it possible to rethink plants phylogeny at the biomolecular era. Given a set of close plants sharing in the order of one hundred of core chloroplastic genes, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a discrete and distributed Particle Swarm Optimization (DPSO) is proposed. It is finally applied to the core genes of Rosales order.
On the properties of a single OPLS-UA model curcumin molecule in water, methanol and dimethyl sulfoxide. Molecular dynamics computer simulation results.
EN: The properties of model solutions consisting of a solute --- single curcumin molecule in water, methanol and dimethyl sulfoxide solvents have been studied using molecular dynamics (MD) computer simulations in the isobaric-isothermal ensemble. The united atom OPLS force field (OPLS-UA) model for curcumin molecule proposed by us recently [J. Mol. Liq., 2016, 223, 707] in combination with the SPC/E water, and the OPLS-UA type models for methanol and dimethyl sulfoxide have been applied. We have described changes of the internal structure of the solute molecule induced by different solvent media in very detail. The pair distribution functions between particular fragments of a solute molecule with solvent particles have been analyzed. Statistical features of the hydrogen bonding between different species were explored. Finally, we have obtained a self-diffusion coefficient of curcumin molecules in three model solvents.
Community interactions determine role of species in parasite spread amplification: the ecomultiplex network model.
EN: Most of zoonoses are multi-host parasites with multiple transmission routes that are usually investigated separately despite their potential interplay. As a unifying framework for modelling parasite spread through different paths of infection, we suggest "ecomultiplex" networks, i.e. multiplex networks representing interacting animal communities with (i) spatial structure and (ii) metabolic scaling. We exploit this ecological framework for testing potential control strategies for $T. cruzii$ spread in two real-world ecosystems. Our investigation highlights two interesting results. Firstly, the ecomultiplex topology can be as efficient as more data-demanding epidemiological measures in identifying which species facilitate parasite spread. Secondly, the interplay between predator-prey and host-parasite interactions leads to a phenomenon of parasite amplification in which top predators facilitate $T. cruzii$ spread, offering theoretical interpretation of previous empirical findings. Our approach is broadly applicable and could provide novel insights in designing immunisation strategies for pathogens with multiple transmission routes in real-world ecosystems.
Ultracold Molecule Assembly with Photonic Crystals.
EN: Photoassociation (PA) is a powerful technique to synthesize molecules directly and continuously from cold and ultracold atoms into deeply bound molecular states. In freespace, however, PA efficiency is constrained by the number of spontaneous decay channels linking the initial excited molecular state to a sea of final (meta)stable rovibronic levels. Here, we propose a novel scheme based on molecules strongly coupled to a guided photonic mode in a photonic crystal waveguide that turns PA into a powerful tool for near deterministic formation of ultracold molecules in their ground rovibrational level. Our example shows a potential ground state molecule production efficiency $> 90\%$, and a saturation rate $>10^6$ molecules per second. By combining state-of-the-art cold atomic and molecular physics with nanophotonic engineering, our scheme presents a novel experimental package for trapping, cooling, and optical manipulation of ultracold molecules, opening up new possibilities in the direction of ultracold chemistry and quantum information.
Ultracold Molecule Assembly with Photonic Crystals.
EN: Photoassociation (PA) is a powerful technique to synthesize molecules directly and continuously from cold and ultracold atoms into deeply bound molecular states. In freespace, however, PA efficiency is constrained by the number of spontaneous decay channels linking the initial excited molecular state to a sea of final (meta)stable rovibronic levels. Here, we propose a novel scheme based on molecules strongly coupled to a guided photonic mode in a photonic crystal waveguide that turns PA into a powerful tool for near deterministic formation of ultracold molecules in their ground rovibrational level. Our example shows a potential ground state molecule production efficiency $> 90\%$, and a saturation rate $>10^6$ molecules per second. By combining state-of-the-art cold atomic and molecular physics with nanophotonic engineering, our scheme presents a novel experimental package for trapping, cooling, and optical manipulation of ultracold molecules, opening up new possibilities in the direction of ultracold chemistry and quantum information.
Ultracold Molecule Assembly with Photonic Crystals.
EN: Photoassociation (PA) is a powerful technique to synthesize molecules directly and continuously from cold and ultracold atoms into deeply bound molecular states. In freespace, however, PA efficiency is constrained by the number of spontaneous decay channels linking the initial excited molecular state to a sea of final (meta)stable rovibronic levels. Here, we propose a novel scheme based on molecules strongly coupled to a guided photonic mode in a photonic crystal waveguide that turns PA into a powerful tool for near deterministic formation of ultracold molecules in their ground rovibrational level. Our example shows a potential ground state molecule production efficiency $> 90\%$, and a saturation rate $>10^6$ molecules per second. By combining state-of-the-art cold atomic and molecular physics with nanophotonic engineering, our scheme presents a novel experimental package for trapping, cooling, and optical manipulation of ultracold molecules, opening up new possibilities in the direction of ultracold chemistry and quantum information.
Stabilization of multiple emulsions using natural surfactants.
EN: In an emulsion system, emulsifier is one of the most important substances as it determines the formation, stability and physicochemical properties of emulsions. In this study, the effects of emulsifier concentration, type of hydrophilic emulsifier, as well as portions of primary emulsion (weight) on the stability of W/O/W emulsions were investigated. Microscopy images of W/O/W emulsions indicated that the emulsions prepared with 0.5 gram of sodium caseinate have superior stability over other synthesis conditions. Finally, emulsions were prepared using different types of emulsifier (NaCN, Cremophor, Tween 60). Our results showed that emulsions made form Cremophor and Tween 60 in comparison with sodium caseinate possess smaller droplets size with enhanced stability.
Comparison of Decision Tree Based Classification Strategies to Detect External Chemical Stimuli from Raw and Filtered Plant Electrical Response.
EN: Plants monitor their surrounding environment and control their physiological functions by producing an electrical response. We recorded electrical signals from different plants by exposing them to Sodium Chloride (NaCl), Ozone (O3) and Sulfuric Acid (H2SO4) under laboratory conditions. After applying pre-processing techniques such as filtering and drift removal, we extracted few statistical features from the acquired plant electrical signals. Using these features, combined with different classification algorithms, we used a decision tree based multi-class classification strategy to identify the three different external chemical stimuli. We here present our exploration to obtain the optimum set of ranked feature and classifier combination that can separate a particular chemical stimulus from the incoming stream of plant electrical signals. The paper also reports an exhaustive comparison of similar feature based classification using the filtered and the raw plant signals, containing the high frequency stochastic part and also the low frequency trends present in it, as two different cases for feature extraction. The work, presented in this paper opens up new possibilities for using pl...
BslA-stabilised emulsion droplets with designed microstructure.
EN: Emulsions are a central component of many modern formulations in food, pharmaceuticals, agrichemicals and personal care products. The droplets in these formulations are limited to being spherical as a consequence of the interfacial tension between the dispersed phase and continuous phase. The ability to control emulsion droplet morphology and stabilise non-spherical droplets would enable the modification of emulsion properties such as stability, substrate binding, delivery rate and rheology. One way of controlling droplet microstructure is to apply an elastic film around the droplet to prevent it from relaxing into a sphere. We have previously shown that BslA, an interfacial protein produced by the bacterial genus Bacillus, forms an elastic film when exposed to an oil- or air-water interface. Here, we highlight BslA's ability to stabilise anisotropic emulsion droplets. First, we show that BslA is capable of arresting dynamic emulsification processes leading to emulsions with variable morphologies depending on the conditions and emulsification technique applied. We then show that frozen emulsion droplets can be manipulated to induce partial coalescence. The structure of the partiall...
iMOLSDOCK : induced-fit docking using mutually orthogonal Latin squares (MOLS).
EN: We have earlier reported the MOLSDOCK technique to perform rigid receptor/flexible ligand docking. The method uses the MOLS method, developed in our laboratory. In this paper we report iMOLSDOCK, the 'flexible receptor' extension we have carried out to the algorithm MOLSDOCK. iMOLSDOCK uses mutually orthogonal Latin squares (MOLS) to sample the conformation and the docking pose of the ligand and also the flexible residues of the receptor protein. The method then uses a variant of the mean field technique to analyze the sample to arrive at the optimum. We have benchmarked and validated iMOLSDOCK with a dataset of 44 peptide-protein complexes with peptides. We have also compared iMOLSDOCK with other flexible receptor docking tools GOLD v5.2.1 and AutoDock Vina. The results obtained show that the method works better than these two algorithms, though it consumes more computer time.
Many-molecule reaction triggered by a single photon in polaritonic chemistry.
EN: The second law of photochemistry states that in most cases, no more than one molecule is activated for an excited-state reaction for each photon absorbed by a collection of molecules. In this work, we demonstrate that it is possible to trigger a many-molecule reaction using only one photon by strongly coupling the molecular ensemble to a confined light mode. The collective nature of the resulting hybrid states of the system (the so-called polaritons) leads to the formation of a polaritonic "supermolecule" involving the degrees of freedom of all molecules, opening a reaction path on which all involved molecules undergo a chemical transformation. We theoretically investigate the system conditions for this effect to take place and be enhanced.
Many-molecule reaction triggered by a single photon in polaritonic chemistry.
EN: The second law of photochemistry states that in most cases, no more than one molecule is activated for an excited-state reaction for each photon absorbed by a collection of molecules. In this work, we demonstrate that it is possible to trigger a many-molecule reaction using only one photon by strongly coupling the molecular ensemble to a confined light mode. The collective nature of the resulting hybrid states of the system (the so-called polaritons) leads to the formation of a polaritonic "supermolecule" involving the degrees of freedom of all molecules, opening a reaction path on which all involved molecules undergo a chemical transformation. We theoretically investigate the system conditions for this effect to take place and be enhanced.
Many-molecule reaction triggered by a single photon in polaritonic chemistry.
EN: The second law of photochemistry states that in most cases, no more than one molecule is activated for an excited-state reaction for each photon absorbed by a collection of molecules. In this work, we demonstrate that it is possible to trigger a many-molecule reaction using only one photon by strongly coupling the molecular ensemble to a confined light mode. The collective nature of the resulting hybrid states of the system (the so-called polaritons) leads to the formation of a polaritonic "supermolecule" involving the degrees of freedom of all molecules, opening a reaction path on which all involved molecules undergo a chemical transformation. We theoretically investigate the system conditions for this effect to take place and be enhanced.
Multivariate Multiscale Dispersion Entropy of Biomedical Times Series.
EN: Objective: Due to the non-linearity of numerous biomedical signals, non-linear analysis of multi-channel time series, notably multivariate multiscale entropy (mvMSE), has been extensively used in biomedical signal processing. However, mvMSE has three drawbacks: 1) mvMSE values are either undefined or unreliable for short signals; 2) mvMSE is not fast enough for real-time applications; and 3) the computation of mvMSE for signals with a large number of channels requires the storage of a huge number of elements. Methods: To deal with these problems and improve the stability of mvMSE, we introduce multivariate multiscale dispersion entropy (MDE - mvMDE) as an extension of our recently developed MDE, to quantify the complexity of multivariate time series. Results: We assess mvMDE, in comparison with mvMSE and multivariate multiscale fuzzy entropy (mvMFE), on correlated and uncorrelated multi-channel noise signals, bivariate autoregressive processes, and three biomedical datasets. The results show that mvMDE takes into account dependencies in patterns across both the time and spatial domains. The mvMDE, mvMSE, and mvMFE methods are consistent in that they lead to similar conclusions abou...
Multivariate Multiscale Dispersion Entropy of Biomedical Times Series.
EN: Objective: Due to the non-linearity of numerous biomedical signals, non-linear analysis of multi-channel time series, notably multivariate multiscale entropy (mvMSE), has been extensively used in biomedical signal processing. However, mvMSE has three drawbacks: 1) mvMSE values are either undefined or unreliable for short signals; 2) mvMSE is not fast enough for real-time applications; and 3) the computation of mvMSE for signals with a large number of channels requires the storage of a huge number of elements. Methods: To deal with these problems and improve the stability of mvMSE, we introduce multivariate multiscale dispersion entropy (MDE - mvMDE) as an extension of our recently developed MDE, to quantify the complexity of multivariate time series. Results: We assess mvMDE, in comparison with mvMSE and multivariate multiscale fuzzy entropy (mvMFE), on correlated and uncorrelated multi-channel noise signals, bivariate autoregressive processes, and three biomedical datasets. The results show that mvMDE takes into account dependencies in patterns across both the time and spatial domains. The mvMDE, mvMSE, and mvMFE methods are consistent in that they lead to similar conclusions abou...
Neural Message Passing for Quantum Chemistry.
EN: Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
Neural Message Passing for Quantum Chemistry.
EN: Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
Neural Message Passing for Quantum Chemistry.
EN: Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
On the Unreported-Profile-is-Negative Assumption for Predictive Cheminformatics.
EN: In cheminformatics, compound-target binding profiles has been a main source of data for research. For data repositories that only provide positive profiles, a popular assumption is that unreported profiles are all negative. In this paper, we caution audience not to take this assumption for granted, and present empirical evidence of its ineffectiveness from a machine learning perspective. Our examination is based on a setting where binding profiles are used as features to train predictive models; we show (1) prediction performance degrades when the assumption fails and (2) explicit recovery of unreported profiles improves prediction performance. In particular, we propose a framework that jointly recovers profiles and learns predictive model, and show it achieves further performance improvement. The presented study not only suggests applying matrix recovery methods to recover unreported profiles, but also initiates a new missing feature problem which we called Learning with Positive and Unknown Features.
Rigidity strengthening is a vital mechanism for protein-ligand binding.
EN: Protein-ligand binding is essential to almost all life processes. The understanding of protein-ligand interactions is fundamentally important to rational drug design and protein design. Based on large scale data sets, we show that protein rigidity strengthening or flexibility reduction is a pivoting mechanism in protein-ligand binding. Our approach based solely on rigidity is able to unveil a surprisingly long range contribution of four residue layers to protein-ligand binding, which has a ramification for drug and protein design. Additionally, the present work reveals that among various pairwise interactions, the short range ones within the distance of the van der Waals diameter are most important. It is found that the present approach outperforms all the other state-of-the-art scoring functions for protein-ligand binding affinity predictions of two benchmark data sets
Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity.
EN: Empirical scoring functions based on either molecular force fields or cheminformatics descriptors are widely used, in conjunction with molecular docking, during the early stages of drug discovery to predict potency and binding affinity of a drug-like molecule to a given target. These models require expert-level knowledge of physical chemistry and biology to be encoded as hand-tuned parameters or features rather than allowing the underlying model to select features in a data-driven procedure. Here, we develop a general 3-dimensional spatial convolution operation for learning atomic-level chemical interactions directly from atomic coordinates and demonstrate its application to structure-based bioactivity prediction. The atomic convolutional neural network is trained to predict the experimentally determined binding affinity of a protein-ligand complex by direct calculation of the energy associated with the complex, protein, and ligand given the crystal structure of the binding pose. Non-covalent interactions present in the complex that are absent in the protein-ligand sub-structures are identified and the model learns the interaction strength associated with these features. We test ou...
Mechanics of a granular skin.
EN: Magic Sand, a hydrophobic toy granular material, is widely used in popular science instructions because of its non-intuitive mechanical properties. A detailed study of the failure of an underwater column of magic sand shows that these properties can be traced to a single phenomenon: the system self-generates a cohesive skin that encapsulates the material inside. The skin, consists of pinned air-water-grain interfaces, shows multi-scale mechanical properties: they range from contact-line dynamics in the intra-grain roughness scale, plastic flow at the grain scale, all the way to the sample-scale mechanical responses. With decreasing rigidity of the skin, the failure mode transforms from brittle to ductile (both of which are collective in nature) to a complete disintegration at the single grain scale.
Token-based Function Computation with Memory.
EN: In distributed function computation, each node has an initial value and the goal is to compute a function of these values in a distributed manner. In this paper, we propose a novel token-based approach to compute a wide class of target functions to which we refer as "Token-based function Computation with Memory" (TCM) algorithm. In this approach, node values are attached to tokens and travel across the network. Each pair of travelling tokens would coalesce when they meet, forming a token with a new value as a function of the original token values. In contrast to the Coalescing Random Walk (CRW) algorithm, where token movement is governed by random walk, meeting of tokens in our scheme is accelerated by adopting a novel chasing mechanism. We proved that, compared to the CRW algorithm, the TCM algorithm results in a reduction of time complexity by a factor of at least $\sqrt{n/\log(n)}$ in Erdös-Renyi and complete graphs, and by a factor of $\log(n)/\log(\log(n))$ in torus networks. Simulation results show that there is at least a constant factor improvement in the message complexity of TCM algorithm in all considered topologies. Robustness of the CRW and TCM algorithms in the presen...
SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules.
EN: Simplified Molecular Input Line Entry System (SMILES) is a single line text representation of a unique molecule. One molecule can however have multiple SMILES strings, which is a reason that canonical SMILES have been defined, which ensures a one to one correspondence between SMILES string and molecule. Here the fact that multiple SMILES represent the same molecule is explored as a technique for data augmentation of a molecular QSAR dataset modeled by a long short term memory (LSTM) cell based neural network. The augmented dataset was 130 times bigger than the original. The network trained with the augmented dataset shows better performance on a test set when compared to a model built with only one canonical SMILES string per molecule. The correlation coefficient R2 on the test set was improved from 0.56 to 0.66 when using SMILES enumeration, and the root mean square error (RMS) likewise fell from 0.62 to 0.55. The technique also works in the prediction phase. By taking the average per molecule of the predictions for the enumerated SMILES a further improvement to a correlation coefficient of 0.68 and a RMS of 0.52 was found.
Assessment of two hybrid van der Waals density functionals for covalent and non-covalent binding of molecules.
EN: Two hybrid van der Waals density functionals (vdW-DFs) are constructed using 25%, Fock exchange with i) the consistent-exchange vdW-DF-cx functional and ii) with the vdW-DF2 functional. The ability to describe covalent and non-covalent binding properties of molecules are assessed. For properties related to covalent binding, atomization energies (G2-1 set), molecular reaction energies (G2RC set), as well as ionization energies (G21IP set) are benchmarked against experimental reference values. We find that hybrid-vdW-DF-cx yields results that are rather similar to those of the standard non-empirical hybrid PBE0 [JCP 110, 6158 (1996)]. Hybrid vdW-DF2 follows somewhat different trends, showing on average significantly larger deviations from the reference energies, with a MAD of 14.5 kcal/mol for the G2-1 set. Non-covalent binding properties of molecules are assessed using the S22 benchmark set of non-covalently bonded dimers and the X40 set of dimers of small halogenated molecules, using wavefunction-based quantum chemistry results for references. For the S22 set, hybrid-vdW-DF-cx performs better than standard vdW-DF-cx for the mostly hydrogen-bonded systems. Hybrid-vdW-DF2 offers a sl...
Time-Series Adaptive Estimation of Vaccination Uptake Using Web Search Queries.
EN: Estimating vaccination uptake is an integral part of ensuring public health. It was recently shown that vaccination uptake can be estimated automatically from web data, instead of slowly collected clinical records or population surveys. All prior work in this area assumes that features of vaccination uptake collected from the web are temporally regular. We present the first ever method to remove this assumption from vaccination uptake estimation: our method dynamically adapts to temporal fluctuations in time series web data used to estimate vaccination uptake. We show our method to outperform the state of the art compared to competitive baselines that use not only web data but also curated clinical data. This performance improvement is more pronounced for vaccines whose uptake has been irregular due to negative media attention (HPV-1 and HPV-2), problems in vaccine supply (DiTeKiPol), and targeted at children of 12 years old (whose vaccination is more irregular compared to younger children).
MOLIERE: Automatic Biomedical Hypothesis Generation System.
EN: Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
MOLIERE: Automatic Biomedical Hypothesis Generation System.
EN: Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE, in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.
On Consistency of Compressive Spectral Clustering.
EN: Spectral clustering is one of the most popular methods for community detection in graphs. A key step in spectral clustering algorithms is the eigen decomposition of the $n{\times}n$ graph Laplacian matrix to extract its $k$ leading eigenvectors, where $k$ is the desired number of clusters among $n$ objects. This is prohibitively complex to implement for very large datasets. However, it has recently been shown that it is possible to bypass the eigen decomposition by computing an approximate spectral embedding through graph filtering of random signals. In this paper, we analyze the working of spectral clustering performed via graph filtering on the stochastic block model. Specifically, we characterize the effects of sparsity, dimensionality and filter approximation error on the consistency of the algorithm in recovering planted clusters.
Rheological and Physicochemical Studies on Emulsions Formulated with Chitosan Previously Dispersed in Aqueous Solutions of Lactic Acid.
EN: Chitosan, a natural, cationic polysaccharide, may be a hydrocolloid strategic to formulate acidic food products, as it can act as both bio-functional and technofunctional constituent. Typically, acetic acid is used to disperse chitosan in aqueous media, but the use of this acid is limited in food formulations due to its flavor. In this study, chitosan was firstly dispersed (0.1% m/V) in lactic acid aqueous solutions (pH 3.0, 3.5 or 4.0), and then evaluated regarding its thickener and emulsion stabilizer properties. O/W emulsions were prepared and characterized in terms of rheological properties, droplets average diameters and droplets $ζ$-potential. Emulsions containing chitosan were 3 times more viscous than controls without chitosan, and presented storage modulus ($G'$) higher than loss modulus ($G''$). Furthermore, they displayed two different populations of droplets (average diameters of 44 and 365 nm) and positive $ζ$-potential values (+50 mV). Droplets average diameters and $ζ$-potential did not present significant changes ($p$ > 0.05) after storage at 25 $^{\circ}$C during 7 days. This study showed that i) food organic acids other than acetic acid can be used to disperse chi...
Protein-Ligand Scoring with Convolutional Neural Networks.
EN: Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring. We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.
On the Parametric Study of Lubricating Oil Production using an Artificial Neural Network (ANN) Approach.
EN: In this study, an Artificial Neural Network (ANN) approach is utilized to perform a parametric study on the process of extraction of lubricants from heavy petroleum cuts. To train the model, we used field data collected from an industrial plant. Operational conditions of feed and solvent flow rate, Temperature of streams and mixing rate were considered as the input to the model, whereas the flow rate of the main product was considered as the output of the ANN model. A feed-forward Multi-Layer Perceptron Neural Network was successfully applied to capture the relationship between inputs and output parameters.
Laser-coolable polyatomic molecules with heavy nuclei.
EN: Recently a number of diatomic and polyatomics molecules has been identified as a prospective systems for Doppler/Sisyphus cooling. Doppler/Sisyphus cooling allows to decrease the kinetic energy of molecules down to microkelvin temperatures with high efficiency and then capture them to molecular traps, including magneto-optical trap. Trapped molecules can be used for creation of molecular fountains and/or performing controlled chemical reactions, high-precision spectra measurements and a multitude of other applications. Polyatomic molecules with heavy nuclei present considerable interest for the search for "new physics" outside of Standard Model and other applications including cold chemistry, photochemistry, quantum informatics etc. Herein we would like to attract attention to radium monohydroxide molecule (RaOH) which is on the one hand an amenable object for laser cooling and on the other hand provides extensive possibilities for searching for P-odd and P,T-odd effects. At the moment RaOH is the heaviest polyatomic molecule proposed for direct cooling with lasers.
Laser-coolable polyatomic molecules with heavy nuclei.
EN: Recently a number of diatomic and polyatomics molecules has been identified as a prospective systems for Doppler/Sisyphus cooling. Doppler/Sisyphus cooling allows to decrease the kinetic energy of molecules down to microkelvin temperatures with high efficiency and then capture them to molecular traps, including magneto-optical trap. Trapped molecules can be used for creation of molecular fountains and/or performing controlled chemical reactions, high-precision spectra measurements and a multitude of other applications. Polyatomic molecules with heavy nuclei present considerable interest for the search for "new physics" outside of Standard Model and other applications including cold chemistry, photochemistry, quantum informatics etc. Herein we would like to attract attention to radium monohydroxide molecule (RaOH) which is on the one hand an amenable object for laser cooling and on the other hand provides extensive possibilities for searching for P-odd and P,T-odd effects. At the moment RaOH is the heaviest polyatomic molecule proposed for direct cooling with lasers.
Laser-coolable polyatomic molecules with heavy nuclei.
EN: Recently a number of diatomic and polyatomics molecules has been identified as a prospective systems for Doppler/Sisyphus cooling. Doppler/Sisyphus cooling allows to decrease the kinetic energy of molecules down to microkelvin temperatures with high efficiency and then capture them to molecular traps, including magneto-optical trap. Trapped molecules can be used for creation of molecular fountains and/or performing controlled chemical reactions, high-precision spectra measurements and a multitude of other applications. Polyatomic molecules with heavy nuclei present considerable interest for the search for "new physics" outside of Standard Model and other applications including cold chemistry, photochemistry, quantum informatics etc. Herein we would like to attract attention to radium monohydroxide molecule (RaOH) which is on the one hand an amenable object for laser cooling and on the other hand provides extensive possibilities for searching for P-odd and P,T-odd effects. At the moment RaOH is the heaviest polyatomic molecule proposed for direct cooling with lasers.
Generation of silicone poly-HIPES with controlled pore sizes via reactive emulsion stabilization.
EN: Macrocellular silicone polymers are obtained after solidification of the continuous phase of a PDMS (polydimethylsiloxane) emulsion, which contains PEG (polyethylene glycol) drops of sub-millimetric dimensions. Coalescence of the liquid template emulsion is prohibited by a reactive blending approach. We investigate in detail the relationship between the interfacial properties and the emulsion stability, and we use micro- and millifluidic techniques to generation macro-cellular polymers with controlled structural properties over a wider range of cell-sizes (0.2-2mm) and volume fractions of the continuous phase (0.1-40%). This approach could easily be transferred to a wide range of polymeric systems.
Tomographic docking suggests the mechanism of auxin receptor TIR1 selectivity.
EN: We study the binding of plant hormone IAA on its receptor TIR1 introducing a novel computational method that we call tomographic docking and that accounts for interactions occurring along the depth of the binding pocket. Our results suggest that selectivity is related to constraints that potential ligands encounter on their way from the surface of the protein to their final position at the pocket bottom. Tomographic docking helps develop specific hypotheses about ligand binding, distinguishing binders from non-binders, and suggests that binding is a three-step mechanism, consisting of engagement with a niche in the back wall of the pocket, interaction with a molecular filter which allows or precludes further descent of ligands, and binding on the pocket base. Only molecules that are able to descend the pocket and bind at its base allow the co-receptor IAA7 to bind on the complex, thus behaving as active auxins. Analyzing the interactions at different depths, our new method helps in identifying critical residues that constitute preferred future study targets and in the quest for safe and effective herbicides. Also, it has the potential to extend the utility of docking from ligand se...
Drying paint: from micro-scale dynamics to mechanical instabilities.
EN: Charged colloidal dispersions make up the basis of a broad range of industrial and commercial products, from paints to coatings and additives in cosmetics. During drying, an initially liquid dispersion of such particles is slowly concentrated into a solid, displaying a range of mechanical instabilities in response to highly variable internal pressures. Here we summarise the current appreciation of this process by pairing an advection-diffusion model of particle motion with a Poisson-Boltzmann cell model of inter-particle interactions, to predict the concentration gradients around a drying colloidal film. We then test these predictions with osmotic compression experiments on colloidal silica, and small-angle x-ray scattering experiments on silica dispersions drying in Hele-Shaw cells. Finally, we use the details of the microscopic physics at play in these dispersions to explore how two macroscopic mechanical instabilities -- shear-banding and fracture -- can be controlled.
Interaction of Tannin with Bovine Serum Albumin by Fluorescence Spectrometry.
EN: Interaction between tannin and bovine serum albumin (BSA) was examined by the fluorescent quenching. The process of elimination between BSA and tannin was the one of a stationary state, and the coupling coefficient was one. The working strength between the tannin and the beef serum was hydrophobic one.
Influence of Fe, Ni, and Cu Doping on the Photocatalytic Efficiency of ZnS: Implications for Prebiotic Chemistry.
EN: The mineral sphalerite (ZnS) is a typical constituent at the periphery of submarine hydrothermal deposits on Earth. It has been frequently suggested to have played an important role in the prebiotic chemistry due to its prominent photocatalytic activity. Nevertheless, the need for λ < 344 nm UV radiation, which accounts for a very minor part of the energy range of the incoming solar spectrum, limits the application of this semiconductor. In this paper we employed a simple co-precipitation method for the fabrication of Fe, Ni, and Cu-doped ZnS colloids and investigated their activities in the photocatalyzed reduction of fumaric acid. The results show that the photocatalytic activity of pristine ZnS is almost identical with that of 0.1 atom% Fe-doped ZnS, but decreases by doping 0.1 atom% Ni. However, it can be significantly enhanced by doping Cu because this dopant makes the optical absorption edges of ZnS shift from UV band to longer wavelengths. The optimal doping concentration was found to be 0.3 atom%. Even under λ > 450 nm light irradiation, the photocatalyst Zn1-xCuxS can drive the reduction of fumaric acid to produce succinic acid. Given the existence of this doped semiconduc...
A big-data spatial, temporal and network analysis of bovine tuberculosis between wildlife (badgers) and cattle.
EN: Bovine tuberculosis (TB) poses a serious threat for agricultural industry in several countries, it involves potential interactions between wildlife and cattle and creates societal problems in terms of human-wildlife conflict. This study addresses connectedness network analysis, the spatial, and temporal dynamics of TB between cattle in farms and the European badger (Meles meles) using a large dataset generated by a calibrated agent based model. Results showed that infected network connectedness was lower in badgers than in cattle. The contribution of an infected individual to the mean distance of disease spread over time was considerably lower for badger than cattle; badgers mainly spread the disease locally while cattle infected both locally and across longer distances. The majority of badger-induced infections occurred when individual badgers leave their home sett, and this was positively correlated with badger population growth rates. Point pattern analysis indicated aggregation in the spatial pattern of TB prevalence in badger setts across all scales. The spatial distribution of farms that were not TB free was aggregated at different scales than the spatial distribution of infe...
Protein-protein docking by generalized Fourier transforms on 5D rotational manifolds.
EN: Energy evaluation using fast Fourier transforms enables sampling billions of putative complex structures and hence revolutionized rigid protein-protein docking. However, in current methods efficient acceleration is achieved only in either the translational or the rotational subspace. Developing an efficient and accurate docking method that expands FFT based sampling to 5 rotational coordinates is an extensively studied but still unsolved problem. The algorithm presented here retains the accuracy of earlier methods but yields at least tenfold speedup. The improvement is due to two innovations. First, the search space is treated as the product manifold $\mathbf{SO(3)x(SO(3)\setminus S^1)}$, where $\mathbf{SO(3)}$ is the rotation group representing the space of the rotating ligand, and $\mathbf{(SO(3)\setminus S^1)}$ is the space spanned by the two Euler angles that define the orientation of the vector from the center of the fixed receptor toward the center of the ligand. This representation enables the use of efficient FFT methods developed for $\mathbf{SO(3)}$. Second, we select the centers of highly populated clusters of docked structures, rather than the lowest energy conformation...
Binary Particle Swarm Optimization versus Hybrid Genetic Algorithm for Inferring Well Supported Phylogenetic Trees.
EN: The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur the phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained provided such a number of blurring genes is reduced. The problem is thus to determine the largest subset of core genes that produces the best-supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a distributed Binary Particle Swarm Optimization (BPSO) is proposed in sequential and distributed fashions. Obtained results from both versions of the BPSO are compared with those computed using an hybrid appro...
Deep learning is competing random forest in computational docking.
EN: Computational docking is the core process of computer-aided drug design; it aims at predicting the best orientation and conformation of a small drug molecule when bound to a target large protein receptor. The docking quality is typically measured by a scoring function: a mathematical predictive model that produces a score representing the binding free energy and hence the stability of the resulting complex molecule. We analyze the performance of both learning techniques on the scoring power, the ranking power, docking power, and screening power using the PDBbind 2013 database. For the scoring and ranking powers, the proposed learning scoring functions depend on a wide range of features (energy terms, pharmacophore, intermolecular) that entirely characterize the protein-ligand complexes. For the docking and screening powers, the proposed learning scoring functions depend on the intermolecular features of the RF-Score to utilize a larger number of training complexes. For the scoring power, the DL_RF scoring function achieves Pearson's correlation coefficient between the predicted and experimentally measured binding affinities of 0.799 versus 0.758 of the RF scoring function. For the...
RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism.
EN: Accuracy and interpretability are two dominant features of successful predictive models. Typically, a choice must be made in favor of complex black box models such as recurrent neural networks (RNN) for accuracy versus less accurate but more interpretable traditional models such as logistic regression. This tradeoff poses challenges in medicine where both accuracy and interpretability are important. We addressed this challenge by developing the REverse Time AttentIoN model (RETAIN) for application to Electronic Health Records (EHR) data. RETAIN achieves high accuracy while remaining clinically interpretable and is based on a two-level neural attention model that detects influential past visits and significant clinical variables within those visits (e.g. key diagnoses). RETAIN mimics physician practice by attending the EHR data in a reverse time order so that recent clinical visits are likely to receive higher attention. RETAIN was tested on a large health system EHR dataset with 14 million visits completed by 263K patients over an 8 year period and demonstrated predictive accuracy and computational scalability comparable to state-of-the-art methods such as RNN, and ease of interpre...
Boosting Docking-based Virtual Screening with Deep Learning.
EN: In this work, we propose a deep learning approach to improve docking-based virtual screening. The introduced deep neural network, DeepVS, uses the output of a docking program and learns how to extract relevant features from basic data such as atom and residues types obtained from protein-ligand complexes. Our approach introduces the use of atom and amino acid embeddings and implements an effective way of creating distributed vector representations of protein-ligand complexes by modeling the compound as a set of atom contexts that is further processed by a convolutional layer. One of the main advantages of the proposed method is that it does not require feature engineering. We evaluate DeepVS on the Directory of Useful Decoys (DUD), using the output of two docking programs: AutodockVina1.1.2 and Dock6.6. Using a strict evaluation with leave-one-out cross-validation, DeepVS outperforms the docking programs in both AUC ROC and enrichment factor. Moreover, using the output of AutodockVina1.1.2, DeepVS achieves an AUC ROC of 0.81, which, to the best of our knowledge, is the best AUC reported so far for virtual screening using the 40 receptors from DUD.
Evidence for marginal stability in emulsions.
EN: We report the first measurements of the effect of pressure on vibrational modes in emulsions, which serve as a model for soft frictionless spheres at zero temperature. As a function of the applied pressure, we find that the density of states D(omega) exhibits a low-frequency cutoff omega, which scales linearly with the number of extra contacts per particle dz. Moreover, for omega<omega, D(omega)~ omega^2/omega*^2; a quadratic behavior whose prefactor is larger than what is expected from Debye theory. This surprising result agrees with recent theoretical findings. Finally, the degree of localization of the softest low frequency modes increases with compression, as shown by the participation ratio as well as their spatial configurations. Overall, our observations show that emulsions are marginally stable and display non-plane-wave modes up to vanishing frequencies.
Highly flexible protein-peptide docking using CABS-dock.
EN: Protein-peptide molecular docking is a difficult modeling problem. It is even more challenging when significant conformational changes that may occur during the binding process need to be predicted. In this chapter, we demonstrate the capabilities and features of the CABS-dock server for flexible protein-peptide docking. CABS-dock allows highly efficient modeling of full peptide flexibility and significant flexibility of a protein receptor. During CABS-dock docking, the peptide folding and binding process is explicitly simulated and no information about the peptide binding site or its structure, is used. This chapter presents a successful CABS-dock use for docking a potentially therapeutic peptide to a protein target. Moreover, simulation contact maps, a new CABS-dock feature, are described and applied to the docking test case. Finally, a tutorial for running CABS-dock from the command line or command line scripts is provided. The CABS-dock web server is available from http://biocomp.chem.uw.edu.pl/CABSdock/
Flexible protein-peptide docking using CABS-dock with knowledge about the binding site.
EN: Despite considerable efforts, structural prediction of protein-peptide complexes is still a very challenging task, mainly due to two reasons: high flexibility of the peptides and transient character of their interactions with proteins. Recently we have developed an automated web server CABS-dock (http://biocomp.chem.uw.edu.pl/CABSdock), which conducts flexible protein-peptide docking without any knowledge about the binding site. Our method allows for full flexibility of the peptide, whereas the flexibility of the receptor is restricted to near native conformations considering the main chain, and full flexibility of the side chains. Performance of the CABS-dock server was thoroughly tested on a benchmark of 171 test cases, both bound and unbound. Evaluation of the obtained results showed overall good performance of the method, especially that no information of the binding site was used. From unsuccessful experiments we learned that the accuracy of docking might be significantly improved, if only little information of the binding site was considered. In fact, in real-life applications user typically has access to some data indicating the location and/or structure of the binding site....
Towards protein-protein docking with significant structural changes using CABS-dock.
EN: The protein-protein interactions (PPIs) are crucial for understanding the majority of cellular processes. PPIs play important role in gene transcription regulation, cellular signaling, molecular basis of immune response and more. Moreover, a disruption of hese mechanisms is frequently postulated as a possible cause of diseases such as Alzheimer's or cancer. For many of biologically relevant cases the structure of protein-protein complexes remain unknown. Therefore computational techniques, including molecular docking, have become a valuable part of drug discovery pipelines. Unfortunately, none of the widely used protein-protein docking tools is free from serious limitations. Typically, in docking simulations the protein flexibility is either completely neglected or very limited. Additionally, some knowledge of the approximate location and/or the shape of the active site is also required. Such limitations arise mostly from the enormous number of degrees of freedom of protein-protein systems. In this paper, an efficient computational method for protein-protein docking is proposed and initially tested on a single docking case. The proposed method is based on a two-step procedure. In t...
Dynamical and structural signatures of the glass transition in emulsions.
EN: We investigate structural and dynamical properties of moderately polydisperse emulsions across an extended range of droplet volume fractions phgr, encompassing fluid and glassy states up to jamming. Combining experiments and simulations, we show that when $φ$ approaches the glass transition volume fraction ${φ_{g}}$ , dynamical heterogeneities and amorphous order arise within the emulsion. In particular, we find an increasing number of clusters of particles having five-fold symmetry (i.e. the so-called locally favoured structures, LFS) as $φ$ approaches ${φ_{g}}$ , saturating to a roughly constant value in the glassy regime. However, contrary to previous studies, we do not observe a corresponding growth of medium-range crystalline order; instead, the emergence of LFS is decoupled from the appearance of more ordered regions in our system. We also find that the static correlation lengths associated with the LFS and with the fastest particles can be successfully related to the relaxation time of the system. By contrast, this does not hold for the length associated with the orientational order. Our study reveals the existence of a link between dynamics and structure close to the glass ...
Parasite Spreading in Spatial Ecological Multiplex Networks.
EN: Network ecology is a rising field of quantitative biology representing ecosystems as complex networks. A suitable example is parasite spreading: several parasites may be transmitted among their hosts through different mechanisms, each one giving rise to a network of interactions. Modelling these networked, ecological interactions at the same time is still an open challenge. We present a novel spatially-embedded multiplex network framework for modelling multi-host infection spreading through multiple routes of transmission. Our model is inspired by T. cruzi, a parasite transmitted by trophic and vectorial mechanisms. Our ecological network model is represented by a multiplex in which nodes represent species populations interacting through a food web and a parasite contaminative layer at the same time. We modelled an SI dynamics in two different scenarios: a simple theoretical food web and an empirical one. Our simulations in both scenarios show that the infection is more widespread when both the trophic and the contaminative interactions are considered with equal rates. This indicates that trophic and contaminative transmission may have additive effects in real ecosystems. We also f...
NMR based Pharmaco-metabolomics: An efficient and agile tool for therapeutic evaluation of Traditional Herbal Medicines.
EN: Traditional Indian (Ayurvedic) and Chinese herbal medicines have been used in the treatment of a variety of diseases for thousands of years because of their natural origin and lesser side effects. However, the safety and efficacy data (including dose and quality parameters) on most of these traditional medicines are far from sufficient to meet the criteria needed to support their world-wide therapeutic use. Also, the mechanistic understanding of most of these herbal medicines is still lacking due to their complex components which further limits their wider application and acceptance. Metabolomics -a novel approach to reveal altered metabolism (biochemical effects) produced in response to a disease or its therapeutic intervention- has huge potential to assess the pharmacology and toxicology of traditional herbal medicines (THMs). Therefore, it is gradually becoming a mutually complementary technique to genomics, transcriptomics and proteomics for therapeutic evaluation of pharmaceutical products (including THMs); the approach is so called pharmaco-metabolomics. The whole paradigm is based on its ability to provide metabolic signatures to confirm the diseased condition and then to us...
Wavelet Analysis in a Canine Model of Gastric Electrical Uncoupling.
EN: Abnormal gastric motility function could be related to gastric electrical uncoupling, the lack of electrical, and respectively mechanical, synchronization in different regions of the stomach. Therefore, non-invasive detection of the onset of gastric electrical uncoupling can be important for diagnosing associated gastric motility disorders. The aim of this study is to provide a wavelet-based analysis of electrogastrograms (EGG, the cutaneous recordings of gastric electric activity), to detect gastric electric uncoupling. Eight-channel EGG recordings were acquired from sixteen dogs in basal state and after each of two circular gastric myotomies. These myotomies simulated mild and severe gastric electrical uncoupling, while keeping the separated gastric sections electrophysiologically active by preserving their blood supply. After visual inspection, manually selected 10-minute EGG segments were submitted to wavelet analysis. Quantitative methodology to choose an optimal wavelet was derived. This "matching" wavelet was determined using the Pollen parameterization for 6-tap wavelet filters and error minimization criteria. After a wavelet-based compression, the distortion of the approxi...

ArXiv Digest: Drug, Cosmetic & Veterinary Science (EN–VI) - 2025-11-11

ArXiv Digest: Drug, Cosmetic & Veterinary Science (EN–VI) - 2025-11-11

Chính sách

Hướng dẫn

Danh mục

Hỗ trợ mua hàng