Research

I am a machine learning and natural language processing (NLP) researcher. Most of my past publications originate from research done during my Ph.D., where I looked into large language models (LLMs) for strategic decision-making. Before my Ph.D., I also co-authored papers and one book in other related fields. Find a list of my research publications below.

Ph.D. Papers

My Ph.D. research investigates using LLMs for strategic decision-making.

Felix Fricke, Simon Malberg, and Georg Groh

In: arXiv preprint

Abstract:

“Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT)—a general-purpose foundation framework for building and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes—Tree of Thoughts, Graph of Thoughts, and ProbTree—within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.”

Citation:

@misc{fricke2026frameworkthoughts,
      title = {Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs}, 
      author = {Fricke, Felix and Malberg, Simon and Groh, Georg},
      year = {2026},
      eprint = {2602.16512},
      archivePrefix = {arXiv},
      primaryClass = {cs.AI},
      url = {https://arxiv.org/abs/2602.16512}, 
}

Jae Yoon Bae, Simon Malberg, Joyce Galang, Andre Retterath, and Georg Groh

In: Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics: Industry Track (EACL 2026)

Abstract:

“Venture capital (VC) investors face a large number of investment opportunities but only invest in few of these, with even fewer ending up successful. Early-stage screening of opportunities is often limited by investor bandwidth, demanding tradeoffs between evaluation diligence and number of opportunities assessed. To ease this tradeoff, we introduce DIALECTIC, an LLM-based multi-agent system for startup evaluation. DIALECTIC first gathers factual knowledge about a startup and organizes these facts into a hierarchical question tree. It then synthesizes the facts into natural-language arguments for and against an investment and iteratively critiques and refines these arguments through a simulated debate, which surfaces only the most convincing arguments. Our system also produces numeric decision scores that allow investors to rank and thus efficiently prioritize opportunities. We evaluate DIALECTIC through backtesting on real investment opportunities aggregated from five VC funds, showing that DIALECTIC matches the precision of human VCs in predicting startup success.”

Citation:

@inproceedings{bae2026dialectic,
    title = {{DIALECTIC}: A Multi-Agent System for Startup Evaluation},
    author = {Bae, Jae Yoon and Malberg, Simon and Galang, Joyce and Retterath, Andre and Groh, Georg},
    editor = {Matusevych, Yevgen and Eryi{\u{g}}it, G{\"u}l{\c{s}}en and Aletras, Nikolaos},
    booktitle = {Proceedings of the 19th Conference of the {E}uropean Chapter of the {A}ssociation for {C}omputational {L}inguistics (Volume 5: Industry Track)},
    month = mar,
    year = {2026},
    address = {Rabat, Morocco},
    publisher = {Association for Computational Linguistics},
    url = {https://aclanthology.org/2026.eacl-industry.53/},
    doi = {10.18653/v1/2026.eacl-industry.53},
    pages = {711--727},
    ISBN = {979-8-89176-384-5}
}

Ahmed Bahloul and Simon Malberg

In: Proceedings of the 2025 IEEE International Conference on Data Mining Workshops (ICDMW). Presented at the 1st Workshop on Reasoning, Agents, Retrieval, and Attribution (RARA)

Abstract:

“Modern language models address complex questions through chain-of-thought (CoT) reasoning and retrieval augmentation, yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge. However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for tree-structured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.”

Citation:

@inproceedings{bahloul2025roots,
    author = {Bahloul, Ahmed and Malberg, Simon},
    title = {From Roots to Rewards: Dynamic Tree Reasoning with Reinforcement Learning},
    booktitle = {Proceedings of the 2025 IEEE International Conference on Data Mining Workshops (ICDMW)},
    year = {2025},
    month = nov,
    address = {Washington, DC, USA},
    publisher = {IEEE},
    url = {https://arxiv.org/abs/2507.13142},
    note = {Presented at the 1st Workshop on Reasoning, Agents, Retrieval, and Attribution (RARA)}
}

Majd Alkayyal, Simon Malberg, and Georg Groh

In: Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track (ECML PKDD 2025)

Abstract:

“We introduce StrategicAI, a decision support system (DSS) for organization leaders and managers responsible for making strategic decisions on the course of their organizations. The main idea behind StrategicAI is to reduce the inherent complexity of strategic decisions using logic trees. These tree structures recursively decompose the involved problem and solution spaces into less-complex parts until these parts become straightforward to answer based on known information. StrategicAI follows a human-AI collaboration philosophy where users are in full control of the tree decompositions applied and can decide flexibly which parts of the trees they create manually and which parts the artificial intelligence (AI) creates. The AI is a multi-agent system based on retrieval-augmented large language models (LLMs). To obtain data-driven insights, StrategicAI actively retrieves facts from user-uploaded files and online sources and incorporates them throughout the created trees.”

Citation:

@inproceedings{alkayyal2025strategicai,
    author = {Alkayyal, Majd and Malberg, Simon and Groh, Georg},
    title = {An {LLM}-Based Decision Support System for Strategic Decision-Making},
    booktitle = {Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track and Demo Track ({ECML PKDD})},
    year = {2025},
    month = oct,
    publisher = {Springer Nature Switzerland},
    address = {Cham},
    pages = {460--464},
    doi = {10.1007/978-3-032-06129-4_31},
    url = {https://link.springer.com/chapter/10.1007/978-3-032-06129-4_31},
    isbn = "978-3-032-06129-4"
}

Simon Malberg, Nikita Grigorev, Victoria Lea Rein, Barbara Borne Bass, Maria Alejandra Gelvez Alvarez, and Georg Groh

In: Proceedings of the 33rd European Conference on Information Systems (ECIS 2025)

Abstract:

“We present an evaluation of ten state-of-the-art large language models (LLMs) taking the role of management consultants. The assessment is performed in a simulated interview setting where the LLMs are tasked with solving case studies – a popular testing procedure during job interviews at management consulting firms. We compose a dataset of eight annotated case studies with a total of 37 business questions and corresponding reference answers. We record the LLMs’ answers and evaluate their performance in terms of outcome and process validity. Our findings indicate that eight out of the ten evaluated LLMs would pass these case interviews, exceeding the minimum level of performance expected from human candidates. Further, the LLMs seem to take measures to ensure high-quality decisions. These findings highlight the potential of LLMs to serve as strategic advisors to managers in charge of making reliable decisions.”

Citation:

@inproceedings{malberg2025llmconsultants,
    author = {Malberg, Simon and Grigorev, Nikita and Rein, Victoria Lea and Borne Bass, Barbara and Gelvez Alvarez, Maria Alejandra and Groh, Georg},
    title = {Bridging {AI} and Business: Are Large Language Models Good Management Consultants?},
    booktitle = {Proceedings of the 33rd European Conference on Information Systems ({ECIS} 2025)},
    year = {2025},
    month = jun,
    address = {Amman, Jordan},
    publisher = {Association for Information Systems (AIS)},
    url = {https://aisel.aisnet.org/ecis2025/ai_org/ai_org/3},
    note = {Paper 3}
}

Rahul Babu Shrestha, Simon Malberg, and Georg Groh

In: Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2025)

Abstract:

“Causal reasoning is a fundamental property of human and machine intelligence. While large language models (LLMs) excel in many natural language tasks, their ability to infer causal relationships beyond memorized associations is debated. This study systematically evaluates recent LLMs’ causal reasoning across three levels of Pearl’s Ladder of Causation—associational, interventional, and counterfactual—as well as commonsensical, anti-commonsensical, and nonsensical causal structures using the CLadder dataset. We further explore the effectiveness of prompting techniques, including chain of thought (CoT), self-consistency (SC), and causal chain of thought (CausalCoT), in enhancing causal reasoning, and propose two new techniques causal tree of thoughts (CausalToT) and causal program of thoughts (CausalPoT). While larger models tend to outperform smaller ones and are generally more robust against perturbations, our results indicate that all tested LLMs still have difficulties, especially with counterfactual reasoning. However, our CausalToT and CausalPoT significantly improve performance over existing prompting techniques, suggesting that hybrid approaches combining LLMs with formal reasoning frameworks can mitigate these limitations. Our findings contribute to understanding LLMs’ reasoning capacities and outline promising strategies for improving their ability to reason causally as humans would.”

Citation:

@inproceedings{shrestha2025causalprophets,
    title = "From Causal Parrots to Causal Prophets? Towards Sound Causal Reasoning with Large Language Models",
    author = "Babu Shrestha, Rahul and Malberg, Simon and Groh, Georg",
    booktitle = "Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities ({NLP4DH})",
    year = "2025",
    month = may,
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    pages = "319--333",
    doi = "10.18653/v1/2025.nlp4dh-1.29",
    url = "https://aclanthology.org/2025.nlp4dh-1.29/",
    ISBN = "979-8-89176-234-3"
}

Simon Malberg, Roman Poletukhin, Carolin M. Schuster, and Georg Groh

In: Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2025)

Abstract:

“We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code and dataset to encourage future research on cognitive biases in LLMs.”

Citation:

@inproceedings{malberg2025cognitivebiases,
    title = "A Comprehensive Evaluation of Cognitive Biases in {LLMs}",
    author = "Malberg, Simon and Poletukhin, Roman and Schuster, Carolin M. and Groh, Georg",
    booktitle = "Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities ({NLP4DH})",
    year = "2025",
    month = may,
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.nlp4dh-1.50/",
    doi = "10.18653/v1/2025.nlp4dh-1.50",
    pages = "578--613",
    isbn = "979-8-89176-234-3"
}

Simon Malberg, Edoardo Mosca, and Georg Groh

In: Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2024)

Abstract:

“Pre-processing and feature engineering are essential yet labor-intensive components of NLP. Engineers must often balance the demand for high model accuracy against interpretability, all while having to deal with unstructured data. We address this issue by introducing Feature Engineering with LLMs for Interpretability and Explainability (FELIX), a novel approach harnessing the vast world knowledge embedded in pre-trained Large Language Models (LLMs) to automatically generate a set of features describing the data. These features are human-interpretable, bring structure to text samples, and can be easily leveraged to train downstream classifiers. We test FELIX across five different text classification tasks, showing that it performs better than feature extraction baselines such as TF-IDF and LLM’s embeddings as well as s.o.t.a. LLM’s zero-shot performance and a fine-tuned text classifier. Further experiments also showcase FELIX’s strengths in terms of sample efficiency and generalization capabilities, making it a low-effort and reliable method for automatic and interpretable feature extraction.”

Citation:

@inproceedings{malberg2024felix,
    author = {Malberg, Simon and Mosca, Edoardo and Groh, Georg},
    title = {{FELIX}: Automatic and Interpretable Feature Engineering Using {LLMs}},
    booktitle = {Machine Learning and Knowledge Discovery in Databases. Research Track ({ECML PKDD})},
    year = {2024},
    month = aug,
    publisher = {Springer Nature Switzerland},
    address = {Cham},
    pages = {230--246},
    url = {https://link.springer.com/chapter/10.1007/978-3-031-70359-1_14},
    doi = {10.1007/978-3-031-70359-1_14},
    isbn = {978-3-031-70359-1}
}

Mohamed Hesham Ibrahim Abdalla, Simon Malberg, Daryna Dementieva, Edoardo Mosca, and Georg Groh

In: Information 2023, 14(10), 522

Abstract:

“As generative NLP can now produce content nearly indistinguishable from human writing, it is becoming difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in machine-generated text can be factually wrong or even entirely fabricated. In this work, we introduce a novel benchmark dataset containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica, as well as papers co-created by humans and ChatGPT. We also experiment with several types of classifiers—linguistic-based and transformer-based—for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of these detectors. Our work makes an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.”

Citation:

@article{abdalla2023idmgsp,
    title = {A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers},
    author = {Abdalla, Mohamed Hesham Ibrahim and Malberg, Simon and Dementieva, Daryna and Mosca, Edoardo and Groh, Georg},
    journal = {Information},
    volume = {14},
    number = {10},
    pages = {522},
    year = {2023},
    month = sep,
    publisher = {MDPI},
    url = {https://www.mdpi.com/2078-2489/14/10/522},
    issn = {2078-2489},
    doi = {10.3390/info14100522}
}

Pre-Ph.D. Papers

During my bachelor’s and master’s I co-authored three research papers, one in the field of strategic management, one on reinforcement learning, and one in the field of operations research and warehouse logistics.

Laurenz Bredenfeld, Manuel Cherubim, Anna Christina Kellermann, Christina Lehmann, Simon Malberg, Julie Rafn, Yona Kwon and Seungho Choi

In: Journal of New Industry and Business 2020, 38(1), p. 47-70

Abstract:

“Tesla, a leading U.S. electric vehicle maker, led the electric vehicle market in 2019, recording about 30 trillion won in annual sales. Although the already saturated automotive industry, Tesla has the upper hand in the electric vehicle sector. This study analyzed Tesla”s competitive advantage factors and innovative firm strategies through analysis of external environment surrounding the electric vehicle industry and internal environment of Tesla. According to the analysis, Tesla”s competitive advantage is not only in developing innovative products, but also in its leadership of chief executive Elon Musk. Tesla, however, faced some issues in 2019 as it focused on producing Model 3 that consumers can purchase at reasonable prices. Through this case study, the pursuit of both a cost-leadership strategy to create affordable products, and a differentiated strategy that targets customers who could purchase high-priced products, could be presented as a threat to the business. At the end of the study, proposed a brief suggestion to how Tesla could deal with issues. By showing what kind of threat it could pose when a firm makes a strategic decision, this study intended to contribute to demonstrating the importance of strategic decision-making by firms.”

Citation:

@article{bredenfeld2020tesla,
  title={Tesla Moving Forward},
  author={Bredenfeld, Laurenz and Cherubim, Manuel and Kellermann, Anna Christina and Lehmann, Christina and Malberg, Simon and Rafn, Julie and Kwon, Yona and Choi, Seungho},
  journal={신산업경영저널},
  volume={38},
  number={1},
  pages={47--70},
  year={2020}
}

Jonas König, Simon Malberg, Martin Martens, Sebastian Niehaus, Artus Krohn-Grimberghe, and Arunselvan Ramaswamy

In: Advances in Computer Vision. CVC 2019. Advances in Intelligent Systems and Computing, vol 943

Abstract:

“We present a reinforcement learning approach for detecting objects within an image. Our approach performs a step-wise deformation of a bounding box with the goal of tightly framing the object. It uses a hierarchical tree-like representation of predefined region candidates, which the agent can zoom in on. This reduces the number of region candidates that must be evaluated so that the agent can afford to compute new feature maps before each step to enhance detection quality. We compare an approach that is based purely on zoom actions with one that is extended by a second refinement stage to fine-tune the bounding box after each zoom step. We also improve the fitting ability by allowing for different aspect ratios of the bounding box. Finally, we propose different reward functions to lead to a better guidance of the agent while following its search trajectories. Experiments indicate that each of these extensions leads to more correct detections. The best performing approach comprises a zoom stage and a refinement stage, uses aspect-ratio modifying actions and is trained using a combination of three different reward metrics.”

Citation:

@inproceedings{konig2019multi,
  title={Multi-stage reinforcement learning for object detection},
  author={K{\"o}nig, Jonas and Malberg, Simon and Martens, Martin and Niehaus, Sebastian and Krohn-Grimberghe, Artus and Ramaswamy, Arunselvan},
  booktitle={Science and Information Conference},
  pages={178--191},
  year={2019},
  organization={Springer}
}

Michaela Beckschäfer, Simon Malberg, Kevin Tierney, and Christoph Weskamp

In: International Conference on Computational Logistics (ICCL 2017)

Abstract:

“Robotic fulfillment systems are becoming commonplace at warehouses across the world. High-density, grid-based storage systems in particular, such as the AutoStore system, are being used in a variety of contexts, but very little literature exists to guide decision makers in picking the right policies for operating such a system. Storage policies can have a large effect on the efficiency and storage capacity of robotic fulfillment systems. We therefore introduce a discrete event simulation for grid-based storage and examine input storage policies under a couple of storage scenarios. Our simulation provides decision makers with an easy way of testing policies before implementing them in a real system, and shows that selecting the correct policy can lead to up to a 7% input performance improvement, and 60% better box utilization.”

Citation:

@inproceedings{beckschafer2017simulating,
  title={Simulating storage policies for an automated grid-based warehouse system},
  author={Becksch{\"a}fer, Michaela and Malberg, Simon and Tierney, Kevin and Weskamp, Christoph},
  booktitle={International Conference on Computational Logistics},
  pages={468--482},
  year={2017},
  organization={Springer}
}

Books

I led a large-scale research project during my A levels. In December 2012, we surveyed more than 900 scholars of the Gymnasium Harsewinkel about their (social) media usage and preferences. We later published the results in a 2015 book (available in German only).

Matthias Eggersmann, Simon Malberg, Simon Specht, and Lars Zumbansen

In: kopaed 2015

Blurb:

“Jugendliche und ihre Medienvorlieben sind beliebte „Objekte“ der empirischen Medienwirkungsforschung. Die allenthalben als „digital natives“ bezeichneten Probanden werden in diesem Zusammenhang häufig genug interviewt und beobachtet, aktiv in die Forschungs- und Auswertungsprozesse eingebunden werden sie jedoch gewöhnlich nicht. Die Beiträge der vorliegenden Publikation vollziehen in dieser Hinsicht einen Perspektivwechsel, zeigen sie doch eine forschende Auseinandersetzung von Schülern mit dem Feld der sozialen Medien auf. In quantitativen und qualitativen Studien, die im Rahmen eines Projektkurses in der gymnasialen Oberstufe eigenständig konzipiert, durchgeführt und analysiert wurden, widmen sich die jungen Forscher vor allem der Bedeutung von Facebook oder Whats App in der Wahrnehmung ihrer Mitschülerinnen und Mitschüler. Dabei werden sowohl der Einfluss mobiler Smartphones auf das Nutzungsverhalten, die Aktivitätsgrade im Umgang mit sozialen Medien, als auch das Wirklichkeitsverständnis der jugendlichen Nutzer in den Blick genommen. Der Band richtet sich an Medienpädagoginnen und Medienpädagogen sowie Lehrerinnen und Lehrer, die sich sowohl für die spezifischen Forschungsdesigns und Ergebnisse der Schülerstudien interessieren, als auch für die grundlegende didaktisch-methodische Konzeption von „Enrichment“-Projekten zur Medienforschung im Kontext von Unterrricht.”

Citation:

@book{eggersmann2015mobil,
  title={mobil--aktiv--entr{\"u}ckt? Sch{\"u}ler erforschen Nutzungsmuster sozialer Medien im Schulalltag},
  author={Eggersmann, Matthias and Malberg, Simon and Specht, Simon and Zumbansen, Lars},
  year={2015},
  publisher={kopaed}
}
Scroll to Top