Home

Page 5 of 150

SCI Publications

2025

A. Panta, A. Gooch, G. Scorzelli, M. Taufer, V. Pascucci. “Scalable Climate Data Analysis: Balancing Petascale Fidelity and Computational Cost,” In 2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing Workshops (CCGridW), pp. 245-248. 2025.

ABSTRACT

The growing resolution and volume of climate data from remote sensing and simulations pose significant storage, processing, and computational challenges. Traditional compression or subsampling methods often compromise data fidelity, limiting scientific insights. We introduce a scalable ecosystem that integrates hierarchical multiresolution data management, intelligent transmission, and ML-assisted reconstruction to balance accuracy and efficiency. Our approach reduces storage and computational costs by 99%, lowering expenses from 100,00 to 24 while maintaining a Root Mean Square (RMS) error of 1.46 degrees Celsius. Our experimental results confirm that even with significant data reduction, essential features required for accurate climate analysis are preserved. Validated on petascale NASA climate datasets, this solution enables cost-effective, high-fidelity climate analysis for research and decision-making.

A. Panta, A. Sahistan, X. Huang, A.A. Gooch, G. Scorzelli, H. Torres, P. Klein, G. A Ovando-Montejo, P. Lindstrom, V. Pascucci. “Expanding Access to Science Participation: A FAIR Framework for Petascale Data Visualization and Analytics,” In IEEE Trans Vis Comput Graph, IEEE, 2025.

ABSTRACT

The massive data generated by scientists daily serve as both a major catalyst for new discoveries and innovations, as well as a significant roadblock that restricts access to the data. Our paper introduces a new approach to removing big data barriers and democratizing access to petascale data for the broader scientific community. Our novel data fabric abstraction layer allows user-friendly querying of scientific information while hiding the complexities of dealing with file systems or cloud services. We enable FAIR (Findable, Accessible, Interoperable, and Reusable) access to datasets such as NASA's petascale climate datasets. Our paper presents an approach to managing, visualizing, and analyzing petabytes of data within a browser on equipment ranging from the top NASA supercomputer to commodity hardware like a laptop. Our novel data fabric abstraction utilizes state-of-the art progressive compression algorithms and machinelearning insights to power scalable visualization dashboards for petascale data. The result provides users with the ability to identify extreme events or trends dynamically, expanding access to scientific data and further enabling discoveries. We validate our approach by improving the ability of climate scientists to visually explore their data via three fully interactive dashboards. We further validate our approach by deploying the dashboards and simplified training materials in the classroom at a minorityserving institution. These dashboards, released in simplified form to the general public, contribute significantly to a broader push to democratize the access and use of climate data.

M. Parashar. “Autonomic Computing Rebooted: Taming the Computing Continuum,” In Transactions on Autonomous and Adaptive Systems, ACM, 2025.
DOI: hps://doi.org/10.1145/3768320

ABSTRACT

Technological advances and rapid service deployments have resulted in a pervasive and interconnected computing continuum, which is enabling new classes of application formulations and workflows, delivering novel services to consumers, and is becoming a core engine for discovery, innovation, and economic growth. The computing continuum is also unleashing new system and application management challenges, which must be addressed before its potential and promises are truly realized. Autonomic computing can provide the abstractions and mechanisms essential to effectively harnessing the computing continuum, but must evolve to address these new challenges. This paper is a call to action for rebooting autonomics to enable us to harness the computing continuum.

T. Patel, T.A.J. Ouermi, T. Athawale,, C.R. Johnson. “Fast HARDI Uncertainty Quantification and Visualization with Spherical Sampling,” In Computer Graphics Forum, Vol. 44, No. 3, pp. 1--12. 2025.

ABSTRACT

In this paper, we study uncertainty quantification and visualization of orientation distribution functions (ODF), which corresponds to the diffusion profile of high angular resolution diffusion imaging (HARDI) data. The shape inclusion probability (SIP) function is the state-of-the-art method for capturing the uncertainty of ODF ensembles. The current method of computing the SIP function with a volumetric basis exhibits high computational and memory costs, which can be a bottleneck to integrating uncertainty into HARDI visualization techniques and tools. We propose a novel spherical sampling framework for faster computation of the SIP function with lower memory usage and increased accuracy. In particular, we propose direct extraction of SIP isosurfaces, which represent confidence intervals indicating spatial uncertainty of HARDI glyphs, by performing spherical sampling of ODFs. Our spherical sampling approach requires much less sampling than the state-of-the-art volume sampling method, thus providing significantly enhanced performance, scalability, and the ability to perform implicit ray tracing. Our experiments demonstrate that the SIP isosurfaces extracted with our spherical sampling approach can achieve up to 8164× speedup, 37282× memory reduction, and 50.2% less SIP isosurface error compared to the classical volume sampling approach. We demonstrate the efficacy of our methods through experiments on synthetic and human-brain HARDI datasets.

A. C. Peterson, M. R. Requist, J. C. Benna, J. R. Nelson, S. Elhabian, C. de Cesar Netto, T. C. Beals, A. L. Lenz. “Talar Morphology of Charcot-Marie-Tooth Patients With Cavovarus Feet,” In Foot & Ankle International, Sage Publications, 2025.
DOI: 10.1177/10711007241309915
PubMed ID: 39937093

ABSTRACT

Background:

Charcot-Marie-Tooth disease (CMT), a common inherited neurologic disorder, significantly impacts the morphology of foot bones, particularly the talus. The disease has been classified into types based on specific mutations, with the most common being CMT type 1 (CMT1; demyelinating) and CMT type 2 (CMT2; axonal). However, the specific osseous morphologic variations in CMT patients and their major genetic subgroups remain insufficiently understood, posing challenges in clinical management and surgical intervention.

Methods:

This study analyzed talar morphology in individuals with CMT compared with a healthy control group, employing a single-bone statistical shape model and talar neck offset angle measurements. Participants included 18 CMT individuals (yielding 29 tali) and 43 healthy controls. For individuals with CMT, the average age at diagnosis was 36.5 ± 19.8 years, with a mean interval of 8.6 years between diagnosis and imaging. Talar morphology was evaluated using weightbearing computed tomography and subsequent morphologic and angular analysis.

Results:

Differences were observed in talar morphology between CMT and healthy individuals. Notably, CMT1 and CMT2 tali exhibited a flatter talar dome and more medial talar head and neck compared with controls. Additionally, the CMT1 and CMT2 subgroups both had a more medially oriented talar neck based on the talar neck offset angle compared with the controls.

Conclusion:

The findings illustrate significant morphologic variations in the talus of CMT patients, indicating the need for type-specific clinical approaches in treating CMT-related foot deformities. Understanding these talar variations is crucial for tailoring surgical techniques and orthotic designs, and developing effective rehabilitation protocols for individuals with CMT, potentially improving patient care and outcomes.

P. Ramonetti, M. Floca, K. O'Laughlin, A. Gupta, M. Parashar, I. Altintas. “National Data Platform's Education Hub,” Subtitled “arXiv:2510.12820v1,” 2025.

ABSTRACT

As demand for AI literacy and data science education grows, there is a critical need for infrastructure that bridges the gap between research data, computational resources, and educational experiences. To address this gap, we developed a first-of-its-kind Education Hub within the National Data Platform. This hub enables seamless connections between collaborative research workspaces, classroom environments, and data challenge settings. Early use cases demonstrate the effectiveness of the platform in supporting complex and resource-intensive educational activities. Ongoing efforts aim to enhance the user experience and expand adoption by educators and learners alike.

A. Roberts, J. Marquez, K.H. NG, K. Mickelson, A. Panta, G. Scorzelli, A. Gooch, P. Cushman, M. Fritts, H. Neog, V. Pascucci, M. Taufer. “The Making of a Community Dark Matter Dataset with the National Science Data Fabric,” Subtitled “arXiv:2507.13297v1,” 2025.

ABSTRACT

Dark matter is believed to constitute approximately 85% of the universe’s matter, yet its fundamental nature remains elusive. Direct detection experiments, though globally deployed, generate data that is often locked within custom formats and non-reproducible software stacks, limiting interdisciplinary analysis and innovation. This paper presents a collaboration between the National Science Data Fabric (NSDF) and dark matter researchers to improve accessibility, usability, and scientific value of a calibration dataset collected with Cryogenic Dark Matter Search (CDMS) detectors at the University of Minnesota. We describe how NSDF services were used to convert data from a proprietary format into an open, multi-resolution IDX structure; develop a web-based dashboard for easily viewing signals; and release a Python-compatible CLI to support scalable workflows and machine learning applications. These contributions enable broader use of high-value dark matter datasets, lower the barrier to entry for new collaborators, and support reproducible, cross-disciplinary research.

R. Basu Roy, T. Patel, B. Li, S. Samsi, V. Gadepally, D. Tiwari. “GreenMix: Energy-Efficient Serverless Computing via Randomized Sketching on Asymmetric Multi-Cores,” In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery, 2025.
ISBN: 9798400714665
DOI: 10.1145/3712285.3759861

ABSTRACT

GreenMix is motivated by the renewed interest in asymmetric multi-core processors and the emergence of the serverless computing model. Asymmetric multi-cores offer better energy and performance trade-offs by placing different core types on the same die. However, existing serverless scheduling techniques do not leverage these benefits. GreenMix is the first serverless work to reduce energy and serverless keep-alive costs while meeting QoS targets by leveraging asymmetric multi-cores. GreenMix employs randomized sketching, tailored for serverless execution and keep-alive, to perform within 10% of the optimal solution in terms of energy efficiency and keep-alive cost reduction. GreenMix’s effectiveness is demonstrated through evaluations on clusters of ARM big.LITTLE and Intel Alder Lake asymmetric processors. It outperforms competing state-of-the-art schedulers, offering a novel approach for energy-efficient serverless computing.

S. Saha, S. Joshi, R. Whitaker. “ARD-VAE: A Statistical Formulation to Find the Relevant Latent Dimensions of Variational Autoencoders,” Subtitled “arXiv:2501.10901,” 2025.

ABSTRACT

The variational autoencoder (VAE) is a popular, deep, latent-variable model (DLVM) due to its simple yet effective formulation for modeling the data distribution. Moreover, optimizing the VAE objective function is more manageable than other DLVMs. The bottleneck dimension of the VAE is a crucial design choice, and it has strong ramifications for the model’s performance, such as finding the hidden explanatory factors of a dataset using the representations learned by the VAE. However, the size of the latent dimension of the VAE is often treated as a hyperparameter estimated empirically through trial and error. To this end, we propose a statistical formulation to discover the relevant latent factors required for modeling a dataset. In this work, we use a hierarchical prior in the latent space that estimates the variance of the latent axes using the encoded data, which identifies the relevant latent dimensions. For this, we replace the fixed prior in the VAE objective function with a hierarchical prior, keeping the remainder of the formulation unchanged. We call the proposed method the automatic relevancy detection in the variational autoencoder (ARD-VAE). We demonstrate the efficacy of the ARD-VAE on multiple benchmark datasets in finding the relevant latent dimensions and their effect on different evaluation metrics, such as FID score and disentanglement analysis.

S. Saha, S. Joshi, R. Whitaker. “Disentanglement Analysis in Deep Latent Variable Models Matching Aggregate Posterior Distributions,” Subtitled “arXiv:2501.15705,” 2025.

ABSTRACT

Deep latent variable models (DLVMs) are designed to learn meaningful representations in an unsupervised manner, such that the hidden explanatory factors are interpretable by independent latent variables (aka disentanglement). The variational autoencoder (VAE) is a popular DLVM widely studied in disentanglement analysis due to the modeling of the posterior distribution using a factorized Gaussian distribution that encourages the alignment of the latent factors with the latent axes. Several metrics have been proposed recently, assuming that the latent variables explaining the variation in data are aligned with the latent axes (cardinal directions). However, there are other DLVMs, such as the AAE and WAE-MMD (matching the aggregate posterior to the prior), where the latent variables might not be aligned with the latent axes. In this work, we propose a statistical method to evaluate disentanglement for any DLVMs in general. The proposed technique discovers the latent vectors representing the generative factors of a dataset that can be different from the cardinal latent axes. We empirically demonstrate the advantage of the method on two datasets.

S. Saha, R. Whitaker. “AdaSemSeg: An Adaptive Few-shot Semantic Segmentation of Seismic Facies,” Subtitled “arXiv:2501.16760,” 2025.

ABSTRACT

Automated interpretation of seismic images using deep learning methods is challenging because of the limited availability of training data. Few-shot learning is a suitable learning paradigm in such scenarios due to its ability to adapt to a new task with limited supervision (small training budget). Existing few-shot semantic segmentation (FSSS) methods fix the number of target classes. Therefore, they do not support joint training on multiple datasets varying in the number of classes. In the context of the interpretation of seismic facies, fixing the number of target classes inhibits the generalization capability of a model trained on one facies dataset to another, which is likely to have a different number of facies. To address this shortcoming, we propose a few-shot semantic segmentation method for interpreting seismic facies that can adapt to the varying number of facies across the dataset, dubbed the AdaSemSeg. In general, the backbone network of FSSS methods is initialized with the statistics learned from the ImageNet dataset for better performance. The lack of such a huge annotated dataset for seismic images motivates using a self-supervised algorithm on seismic datasets to initialize the backbone network. We have trained the AdaSemSeg on three public seismic facies datasets with different numbers of facies and evaluated the proposed method on multiple metrics. The performance of the AdaSemSeg on unseen datasets (not used in training) is better than the prototype-based few-shot method and baselines.

A. Sahistan, S. Zellmann, N. Morrical, V. Pascucci, I. Wald. “Multi-Density Woodcock Tracking: Efficient & High-Quality Rendering for Multi-Channel Volumes,” In Eurographics Symposium on Parallel Graphics and Visualization, Eurographics, 2025.

ABSTRACT

Volume rendering techniques for scientific visualization have increasingly transitioned toward Monte Carlo (MC) methods in recent years due to their flexibility and robustness. However, their application in multi-channel visualization remains underexplored. Traditional compositing-based approaches often employ arbitrary color blending functions, which lack a physical basis and can obscure data interpretation. We introduce multi-density Woodcock tracking, a simple and flexible extension of Woodcock tracking for multi-channel volume rendering that leverages the strengths of Monte Carlo methods to generate high-fidelity visuals. Our method offers a physically grounded solution for inter-channel color blending and eliminates the need for arbitrary blending functions. We also propose a unified blending modality by generalizing Woodcock’s distance tracking method, facilitating seamless integration of alternative blending functions from prior works. Through evaluation across diverse datasets, we demonstrate that our approach maintains real-time interactivity while achieving high-quality visuals by accumulating frames over time.

S.A. Sakin, K.E. Isaacs. “Managing Data for Scalable and Interactive Event Sequence Visualization,” Subtitled “arXiv:2508.03974,” 2025.

ABSTRACT

Parallel event sequences, such as those collected in program execution traces and automated manufacturing pipelines, are typically visualized as interactive parallel timelines. As the dataset size grows, these charts frequently experience lag during common interactions such as zooming, panning, and filtering. Summarization approaches can improve interaction performance, but at the cost of accuracy in representation. To address this challenge, we introduce ESeMan (Event Sequence Manager), an event sequence management system designed to support interactive rendering of timeline visualizations with tunable accuracy. ESeMan employs hierarchical data structures and intelligent caching to provide visualizations with only the data necessary to generate accurate summarizations with significantly reduced data fetch time. We evaluate ESeMan's query times against summed area tables, M4 aggregation, and statistical sub-sampling on a variety of program execution traces. Our results demonstrate ESeMan provides better performance, achieving sub-100ms fetch times while maintaining visualization accuracy at the pixel level. We further present our benchmarking harness, enabling future performance evaluations for event sequence visualization.

A. Salinas, I. Sohail, V. Pascucci, P. Stefanakis, S. Amjad, A. Panta, R. Schigas, T. Chun-Yiu Chui, N. Duboc, M. Farrokhabadi, R. Stull. “Climate Data for Power Systems Applications: Lessons in Reusing Wildfire Smoke Data for Solar PV Studies,” Subtitled “arXiv:2509.09888v2,” 2025.

ABSTRACT

Data reuse is using data for a purpose distinct from its original intent. As data sharing becomes more prevalent in science, enabling effective data reuse is increasingly important. In this paper, we present a power systems case study of data repurposing for enabling data reuse. We define data repurposing as the process of transforming data to fit a new research purpose. In our case study, we repurpose a geospatial wildfire smoke forecast dataset into a historical dataset. We analyze its efficacy toward analyzing wildfire smoke impact on solar photovoltaic energy production. We also provide documentation and interactive demos for using the repurposed dataset. We identify key enablers of data reuse including metadata standardization, contextual documentation, and communication between data creators and reusers. We also identify obstacles to data reuse such as risk of misinterpretation and barriers to efficient data access. Through an iterative approach to data repurposing, we demonstrate how leveraging and expanding knowledge transfer infrastructures like online documentation, interactive visualizations, and data streaming directly address these obstacles. The findings facilitate big data use from other domains for power systems applications and grid resiliency.

A. Samanta, Y. Jiang, R. Stutsman, R.B. Roy. “Not Just Fast, But Also Sustainable: Rethinking Network Routing,” In MAIoT '25: Proceedings of the Middleware for Autonomous AIoT Systems in the Computing Continuum , ACM, 2025.

ABSTRACT

The carbon footprint of networking poses a significant environmental concern. In this paper, we show how transmission latency-focused traditional routing can be sub-optimal in terms of carbon footprint. Based on the dynamic factors that affect a network’s carbon footprint and transmission latency, we design a carbon-aware routing solution. We evaluate our solution on real backbone Internet topologies to show that there is an opportunity to minimize the network’s carbon emissions with only modest latency penalties.

A. Samanta, R. Stutsman, R.B. Roy. “GridGreen: Integrating Serverless Computing in HPC Systems for Performance and Sustainability,” In SoCC '25: Proceedings of the 2025 ACM Symposium on Cloud Computing, ACM, 2025.

ABSTRACT

We present GridGreen, a scheduling framework that improves the sustainability and performance of scientific workflow execution by integrating serverless computing with traditional on-premise high performance computing (HPC) clusters. GridGreen allocates workflow components across HPC and serverless environments leveraging spatio-temporal variation of carbon intensity and component execution characteristics. It incorporates component-level optimization, speculative pre-warming, I/O-aware data management, and fallback adaptation to jointly minimize carbon footprint and service time under user-defined cost constraints. Our evaluations on large-scale bioinformatics workflows across leadership-class HPC facilities and cloud-based serverless regions demonstrate that GridGreen achieves robust, cost-effective execution while improving carbon efficiency.

A. Samanta, Y. Jiang, R. Stutsman, R.B. Roy. “Water Footprint of Datacenter Applications: Methodological Implications of Manufacturing, Operational, and Decommissioning Phases,” In SoCC '25: Proceedings of the 2025 ACM Symposium on Cloud Computing , ACM, 2025.

ABSTRACT

Rising computational demands have made cloud datacenters’ water footprint a critical concern. We demonstrate how different water footprint accounting methodologies – incorporating operational, manufacturing, and decommissioning water consumption, impact measurements and highlight the need for methodology standardization for water-aware operations. Our analysis reveals opportunities for water-aware scheduling in datacenters by considering regional water variations and lifecycle impacts.

J. Schuchart, P. Diehl, M. Bauer, A. Bouteiller, G. Daiss, E. Kayraklioglu, S. Khandekar, T. Herault, J. K. Holmen, R. S. Rao, A. Strack, E. Schlaughter, J. C. Spinti, J. N. Thornock, A. Aiken, O. Aumage, M. Berzins, G. Bosilca, B. L. hamberlain, H. Kaiser,, L. Kale. “A Survey of Distributed Asynchronous Many-Task Models and Their Applications,” Subtitled “Authorea Preprints,” 2025.

ABSTRACT

Asynchronous many-task (AMT) runtime systems have become an important paradigm for expressing fine-grained parallelism and managing asynchrony in high-performance computing (HPC). Originating from early dataflow concepts, AMTs have evolved to enable dynamic task generation, explicit dependency management, and asynchronous execution, facilitating the overlap of computation and communication. These capabilities address the limitations of traditional bulk-synchronous models, such as those employed in MPI+X, which can struggle with irregular, adaptive, or data-driven workloads. This survey provides a comprehensive overview of representative distributed AMT systems—including Charm++, HPX, Legion, PaRSEC, Uintah, Chapel, and StarPU—focusing on their design principles, execution models, and runtime mechanisms for scheduling, communication, and synchronization. We examine how these systems tackle key challenges such as load imbalance, runtime overheads, programmability, and performance portability. In addition, the paper discusses application domains where AMTs have demonstrated tangible benefits and highlights the conditions under which their use is most advantageous. The goal of this survey is to equip researchers and practitioners with a clear understanding of distributed AMT models and to provide guidance for selecting and applying the most suitable runtime system for specific computational objectives.

C. Scully-Allison, K. Williams, S. Brink, O. Pearce, K. Isaacs. “A Tale of Two Models: Understanding Data Workers' Internal and External Representations of Complex Data,” Subtitled “arXiv:2501.09862v2,” 2025.

ABSTRACT

Data workers may have a different mental model of their data than the one reified in code. Understanding the organization of their data is necessary for analyzing data, be it through scripting, visualization, or abstract thought. More complicated organizations, such as tables with attached hierarchies, may tax people’s ability to think about and interact with data. To better understand and ultimately design for these situations, we conduct a study across a team of ten people working with the same reified data model. Through interviews and sketching, we probed their conception of the data model and developed themes through reflexive data analysis. Participants had diverse data models that differed from the reified data model, even among team members who had designed the model, resulting in parallel hazards limiting their ability to reason about the data. From these observations, we suggest potential design interventions for data analysis processes and tools.

C. Scully-Allison, K. Menear, K. Potter, A. McNutt, K.E. Isaacs, D. Duplyakin. “Same Data, Different Audiences: Using Personas to Scope a Supercomputing Job Queue Visualization,” 2025.

ABSTRACT

Domain-specific visualizations sometimes focus on narrow, albeit important, tasks for one group of users. This focus limits the utility of a visualization to other groups working with the same data. While tasks elicited from other groups can present a design pitfall if not disambiguated, they also present a design opportunity -- development of visualizations that support multiple groups. This development choice presents a trade off of broadening the scope but limiting support for the more narrow tasks of any one group, which in some cases can enhance the overall utility of the visualization. We investigate this scenario through a design study where we develop \textitGuidepost, a notebook-embedded visualization of supercomputer queue data that helps scientists assess supercomputer queue wait times, machine learning researchers understand prediction accuracy, and system maintainers analyze usage trends. We adapt the use of personas for visualization design from existing literature in the HCI and software engineering domains and apply them in categorizing tasks based on their uniqueness across the stakeholder personas. Under this model, tasks shared between all groups should be supported by interactive visualizations and tasks unique to each group can be deferred to scripting with notebook-embedded visualization design. We evaluate our visualization with nine expert analysts organized into two groups: a "research analyst" group that uses supercomputer queue data in their research (representing the Machine Learning researchers and Jobs Data Analyst personas) and a "supercomputer user" group that uses this data conditionally (representing the HPC User persona). We find that our visualization serves our three stakeholder groups by enabling users to successfully execute shared tasks with point-and-click interaction while facilitating case-specific programmatic analysis workflows.

Page 5 of 150