Luka Borozan

Postdoc

lborozan@mathos.hr

6 (ground floor)

Google Scholar Profile

School of Applied Mathematics and Informatics

Josip Juraj Strossmayer University of Osijek

Research Interests

Computational molecular biology
Discrete and convex optimization
Linear programming

Degrees

PhD in Mathematics, Faculty of Science – Department of Mathematics, University of Zagreb, Croatia, 2021.
MSc in Mathematics and Computer Science, Department of Mathematics, University of Osijek, Croatia, 2015.
BSc in Mathematics, Department of Mathematics, University of Osijek, Croatia, 2013.

Publications

Journal Publications

B. Borozan, T. Prusina, L. Borozan, D. Ševerdija, F. Rojas Ringeling, D. Matijević, S. Canzar, Optimal marker genes for c-separated cell types with SepSolve , Genome Research 35/ (2025), 2770-2780

Abstract

The identification of cell types in single-cell RNA-seq studies relies on the distinct expression signature of marker genes. A small set of target genes is also needed to design probes for targeted spatial transcriptomic experiments and to target proteins in single-cell spatial proteomics or for cell sorting. Although traditional approaches have relied on testing one gene at a time for differential expression between a given cell type and the rest, more recent methods have highlighted the benefits of a joint selection of markers that together distinguish all pairs of cell types simultaneously. However, existing methods either consider all pairs of individual cells, which becomes intractable even for medium-sized data sets, or ignore intra-cell-type expression variation entirely by collapsing all cells of a given type to a single representative. Here, we address these limitations and propose to find a small set of genes such that cell types are c-separated in the selected dimensions, a notion introduced previously in learning a mixture of Gaussians. To this end, we formulate a linear program that naturally takes into account expression variation within cell types without including each pair of individual cells in the model, leading to a highly stable set of marker genes that allow to accurately discriminate between cell types and that can be computed to optimality efficiently.
L. Borozan, F. Rojas Ringeling, S. Kao, E. Nikonova, P. Monteagudo, D. Matijević, M.L. Spletter, S. Canzar, Counting pseudoalignments to novel splicing events , Bioinformatics 39/7 (2023)

Abstract

Motivation Alternative splicing (AS) of introns from pre-mRNA produces diverse sets of transcripts across cell types and tissues, but is also dysregulated in many diseases. Alignment-free computational methods have greatly accelerated the quantification of mRNA transcripts from short RNA-seq reads, but they inherently rely on a catalog of known transcripts and might miss novel, disease-specific splicing events. By contrast, alignment of reads to the genome can effectively identify novel exonic segments and introns. Event-based methods then count how many reads align to predefined features. However, an alignment is more expensive to compute and constitutes a bottleneck in many AS analysis methods. Results Here, we propose fortuna, a method that guesses novel combinations of annotated splice sites to create transcript fragments. It then pseudoaligns reads to fragments using kallisto and efficiently derives counts of the most elementary splicing units from kallisto’s equivalence classes. These counts can be directly used for AS analysis or summarized to larger units as used by other widely applied methods. In experiments on synthetic and real data, fortuna was around 7× faster than traditional align and count approaches, and was able to analyze almost 300 million reads in just 15 min when using four threads. It mapped reads containing mismatches more accurately across novel junctions and found more reads supporting aberrant splicing events in patients with autism spectrum disorder than existing methods. We further used fortuna to identify novel, tissue-specific splicing events in Drosophila. Availability and implementation fortuna source code is available at https://github.com/canzarlab/fortuna.
D. Ševerdija, T. Prusina, L. Borozan, D. Matijević, Efficient Sentence Representation Learning via Knowledge Distillation with Maximum Coding Rate Reduction, CIT. Journal of Computing and Information Technology 31/4 (2023)

Abstract

Addressing the demand for effective sentence representation in natural language inference problems, this paper explores the utility of pre-trained large language models in computing such representations. Although these models generate high-dimensional sentence embeddings, a noticeable performance disparity arises when they are compared to smaller models. The hardware limitations concerning space and time necessitate the use of smaller, distilled versions of large language models. In this study, we investigate the knowledge distillation of Sentence-BERT, a sentence representation model, by introducing an additional projection layer trained on the novel Maximum Coding Rate Reduction (MCR2) objective designed for general-purpose manifold clustering. Our experiments demonstrate that the distilled language model, with reduced complexity and sentence embedding size, can achieve comparable results on semantic retrieval benchmarks, providing a promising solution for practical applications.
Đ. Borozan, L. Borozan, Analyzing total-factor energy efficiency in Croatian counties: evidence from a non-parametric approach, Central European Journal of Operations Research 26/3 (2018), 673-694

Abstract

Using energy efficiently has become top priority concern which requires an adequate policy reaction bearing in mind both energy conservation and efforts to combat adverse climate changes. The paper explored the total-factor energy efficiency and change trends in technical efficiency in the Croatian counties during the period 2001–2013. Employing data envelopment analysis, the overall technical, pure technical and scale efficiency are assessed. Considering the empirical results, we have concluded the following. Technical inefficiency is generated almost equally by the pure technical effect and an incorrect production scale. The overall geographical distribution of the technical efficiency scores points to the presence of spatial concentration, i.e., a dualistic pattern (centre vs. periphery) in the production process. The differences between the best practice and the worst technical efficiency scores indicate the presence of significant disparities among Croatian counties. The years with deteriorating electricity efficiency seem to coincide with the important economic/energy changes that happened in Croatia. Finally, subnational governments may play an important role in energy efficiency policies.
D. Marković, L. Borozan, On Parameter Estimation by Nonlinear Least Squares in Some Special Two-Parameter Exponential Type Models, Applied Mathematics & Information Sciences 9/6 (2015), 2925-2931

Abstract

Two-parameter growth models of exponential type f(t;a,b) = g(t)exp(a+bh(t)), where a and b are unknown parameters and g and h are some known functions, are frequently employed in many different areas such as biology, finance, statistic, medicine, ect. The unknown parameters must be estimated from the data (w_i, t_i, y_i), i = 1,...,n, where t_i denote the values of the independent variable, y_i are respective estimates of regression function f and w_i > 0 are some data weights. A very popular and widely used method for parameter estimation is the method of least squares. In practice, to avoid using nonlinear regression, this kind of problems are commonly transformed to linear, which is not statistically justified. In this paper we show that for strictly positive g and strictly monotone h original nonlinear problem has a solution. Generalization in the lp norm (1 ≤ p < ∞) and some illustrative examples are also given.

Refereed Proceedings

B. Borozan, L. Borozan, D. Ševerdija, F. Rojas Ringeling, D. Matijević, Optimal Marker Genes for c-Separated Cell Types, RECOMB 2025, Seoul, Republika Koreja, 2025

Abstract
The identification of cell types in single-cell RNA-seq studies relies on the distinct expression signature of marker genes. A small set of target genes is also needed to design probes for targeted spatial transcriptomics experiments and to target proteins in single-cell spatial proteomics or for cell sorting. While traditional approaches have relied on testing one gene at a time for differential expression between a given cell type and the rest, more recent methods have highlighted the benefits of a joint selection of markers that together distinguish all pairs of cell types simultaneously. However, existing methods either impose constraints on all pairs of individual cells which becomes intractable even for medium-sized datasets, or ignore intra-cell type expression variation entirely by collapsing all cells of a given type to a single representative. Here we address these limitations and propose to find a small set of genes such that cell types are c-separated in the selected dimensions, a notion introduced previously in learning a mixture of Gaussians. To this end, we formulate a linear program that naturally takes into account expression variation within cell types without including each pair of individual cells in the model, leading to a highly stable set of marker genes that allow to accurately discriminate between cell types and that can be computed to optimality efficiently.
D. Ševerdija, T. Prusina, A. Jovanović, L. Borozan, J. Maltar, D. Matijević, Compressing Sentence Representation with Maximum Coding Rate Reduction (Best paper award in AIS - Artificial Intelligence Systems track), ICT and Electronics Convention (MIPRO), 2023 46th MIPRO, Opatija, Hrvatska, 2023

Abstract
In most natural language inference problems, sentence representation is needed for semantic retrieval tasks. In recent years, pre-trained large language models have been quite effective for computing such representations. These models produce high-dimensional sentence embeddings. An evident performance gap between large and small models exists in practice. Hence, due to space and time hardware limitations, there is a need to attain comparable results when using the smaller model, which is usually a distilled version of the large language model. In this paper, we assess the model distillation of the sentence representation model Sentence-BERT by augmenting the pre-trained distilled model with a projection layer additionally learned on the Maximum Coding Rate Reduction (MCR2) objective, a novel approach developed for general purpose manifold clustering. We demonstrate that the new language model with reduced complexity and sentence embedding size can achieve comparable results on semantic retrieval benchmarks.
B. Borozan, L. Borozan, D. Ševerdija, D. Matijević, S. Canzar, Fortuna Detects Novel Splicing in Drosophila scRNASeq Data, ICT and Electronics Convention (MIPRO), 2023 46th MIPRO, Opatija, Hrvatska, 2023, 410-415

Abstract
Recent developments in single-cell RNA sequencing techniques (scRNASeq) have made large quantities of sequenced data available across numerous species and tissues. Alternative splicing (AS) of pre-mRNA introns varies between tissues and even between cell-types and can be altered in disease. The study of novel AS, using standard RNASeq data, has been extensively studied for many years, while similar work on scRNASeq data has been scarce, despite its potential to offer a broader insight into cell-type specific processes. In this paper, we propose a novel pipeline that uses fortuna, a method that efficiently classifies and quantifies novel AS events, to process scRNASeq samples. Due to its short lifespan, high number of progeny, low maintenance cost, and intricate alternative splicing patterns similar in complexity to those of mammals, Drosophila Melanogaster (fruit fly) is a species of particular interest to researchers. Therefore, we experimentally evaluate our pipeline on real-world Drosophila single-cell data samples from the Fly Cell Atlas.
V. Hoan Do, M. Blažević, P. Monteagudo, L. Borozan, K. Elbassioni, S. Laue, F. Rojas Ringeling, D. Matijević, S. Canzar, Dynamic pseudo-time warping of complex single-cell trajectories, 23nd Annual International Conference on Research in Computational Molecular Biology, The George Washington University, 2019, 294-297

Abstract
Single-cell RNA sequencing enables the construction of trajectories describing the dynamic changes in gene expression underlying biological processes such as cell differentiation and development. The comparison of single-cell trajectories under two distinct conditions can illuminate the differences and similarities between the two and can thus be a powerful tool. Recently developed methods for the comparison of trajectories rely on the concept of dynamic time warping (dtw), which was originally proposed for the comparison of two time series. Consequently, these methods are restricted to simple, linear trajectories. Here, we adopt and theoretically link arboreal matchings to dtw and propose an algorithm to compare complex trajectories that more realistically contain branching points that divert cells into different fates. We implement a suite of exact and heuristic algorithms suitable for the comparison of trajectories of different characteristics in our tool Trajan. Trajan automatically pairs similar biological processes between conditions and aligns them in a globally consistent manner. In an alignment of singlecell trajectories describing human muscle differentiation and myogenic reprogramming, Trajan identifies and aligns the core paths without prior information. From Trajan’s alignment, we are able to reproduce recently reported barriers to reprogramming. In a perturbation experiment, we demonstrate the benefits in terms of robustness and accuracy of our model which compares entire trajectories at once, as opposed to a pairwise application of dtw. Trajan is available at https://github.com/canzarlab/Trajan.
L. Borozan, D. Matijević, S. Canzar, Properties of the generalized Robinson-Foulds metric, 42nd International Convention - MIPRO 2019, Opatija, 2019, 330-335

Abstract
Comparing hierarchical structures is a problem with many applications in various fields of biology. In this work we address the problem of comparing phylogenetic trees and quantifying their dissimilarities. The most commonly applied measure of similarity between phylogenetic trees is the Robinson Foulds (RF) metric. The Jaccard-Robinson-Foulds (JRF) metric (of order k) has been recently proposed as a generalization of the RF metric that preserves its widely appreciated properties but increases its resolution and robustness. Here, we conduct thorough experimental analysis of the JRF metric and variations thereof on both real world and simulated data. Our main aim is to deepen the understanding of the properties of this generalized RF metric in comparison to the classical RF metric and other matching based distance measures. To compute the JRF distance between trees, we employ the recently proposed branch-and-cut solver Trajan.
Đ. Borozan, L. Borozan, The stationarity of per capita electricity consumption in Croatia allowing for structural break(s), 13th International Symposium on Operational Research, Bled, Slovenia, 2015, 337-342

Abstract
Understanding the stationarity properties of electricity consumption provides valuable insights for energy policy-makers and practitioners. The paper examines the unit root properties of per capita electricity consumption for Croatian counties using the panel unit root tests with structural break(s) during the period 2001-2013. The results indicate that the series of most counties are non-stationary processes, and that statistically significant structural break(s) happened only in a few of them. Hence, the impacts of shocks on per capita electricity consumption are permanent and have a long memory for a majority of them. Moreover, their behaviors are path-dependent.

Others

L. Borozan, D. Matijević, S. Canzar, Combinatorial optimization algorithms for (pseudo)alignment in bioinformatics (2021)

Abstract
The field of bioinformatics is a fast growing interdisciplinary field with a strong contribution from mathematics and computer science. This thesis will deal with mathematical problems and algorithmic challenges from that field. Its first focus will be the comparison of hierarchic structures, mainly phylogenetic trees, which is used to explain various biological processes such as the evolution of the species. We will study mathematical models and algorithmic techniques which quantify the distance between such structures as means of determining the similarities or dissimilarities between them. The focus will be given to formulating the problem based on matching in the context of integer linear programming. Our goal will be to find a novel solution which respects the ancestry relations defined by those hierarchical structures and is often overlooked in the current research. Our main result will be given in a form of a software tool - Trajan, which will be tested on both the real world and simulated data. The second focus of the thesis will come from the problem of sequencing the RNA molecule. It is a combinatorial process of reconstruction of the RNA molecule from short nucleotide sequences which is used to analyze the transcriptome of a biological sample. Many recent studies consider a problem of quantification and classification of unannotated splicing events which often occur due to the mutations caused by abnormal state of the organism, e.g. cancer. We will present another software tool, called fortuna, which brings together high accuracy and fast running times to the analysis of the alternative splicing events unlike any of the well established competitor tools.
D. Matijević, D. Ševerdija, S. Jelić, L. Borozan, Uparena optimizacijska metoda, Math.e : hrvatski matematički elektronski časopis 30/ (2016)

Abstract
U ovom članku analiziramo metode gradijentnog i zrcalnog spusta u području konveksne optimizacije s danim naglaskom na njihove brzine konvergencije. Nadalje, uparujući dvije spomenute metode dobivamo takozvanu uparenu metodu čija analiza konvergencije pokazuje ubrzanje u odnosu na gradijentnu i zrcalnu metodu, te bilo koju drugu nama poznatu metodu prvoga reda.

Projects

Razvoj interaktivnog virtualnog okruženja
Odjel za matematiku, Sveučilište u Osijeku – Sveučilište u Osijeku), voditelj projekta: Luka Borozan
1.5.2021. – 1.5.2025.
Primjena metoda optimizacije u biomedicini, (Odjel za matematiku, Sveučilište u Osijeku – Ministarstvo znanosti i obrazovanja, Program znanstveno-tehnološke suradnje između Republike Hrvatske i Republike Srbije), voditelji projekta: Slobodan Jelić, Dušan Jakovetić, 01.01.2019. – 01.07. 2022.
Problem procjene parametara u nekim dvoparametarskim monotonim matematičkim modelima (Odjel za matematiku, Sveučilište u Osijeku – Sveučilište u Osijeku), voditelj projekta: Darija Marković, 25.9.2013. – 24.9.2014.

Teaching

Functional programming
Modern computer systems
Operational systems
Programming language semantics
Mathematical logics in computer science