New study: Can a transparent heuristic outperform AI in tracing founder origins?

Prof. Dr. David Bendig (l.), Dr. Jonathan Hoke (m.), Leoni Onken (r.)
How accurately can a person's country of origin be inferred from large-scale social networking data, and which computational approach performs best? This study, published in Technological Forecasting & Social Change (VHB JQ3: B; IF: 12.0), examines this question by systematically comparing three approaches — a rule-based heuristic, four supervised machine-learning models, and a large language model (GPT-4o, zero-shot) — across 80 countries of origin. Drawing on 500 expert-validated LinkedIn profiles, the findings indicate that a transparent, theoretically grounded rule-based heuristic is associated with higher accuracy (87.2 %) than the large language model (82.2 %) and matches the best machine-learning models in the sample.
The key findings:
- A transparent heuristic reaches higher accuracy than GPT-4o: Across 500 expert-validated LinkedIn profiles spanning 80 countries of origin, the rule-based heuristic is associated with higher overall accuracy (87.2 %) than GPT-4o in zero-shot mode (82.2 %) and matches the best machine-learning models in our sample. The heuristic also offers full transparency and reproducibility which are properties that machine-learning and large language models cannot provide in the same way.
- The large language model often relies on current residence rather than stable origin cues: In 26 of 45 cases where the heuristic was correct and GPT-4o was not, the model selected the explicitly mentioned country of residence rather than weighting stable biographical signals such as native language and education. The heuristic prioritizes signals acquired in early life (language, education) over later-life mobility indicators (work history, residence), which appear to align with the underlying information structure.
- Combining multiple profile signals is associated with substantially higher accuracy: Predictions drawing on five features (name, residence, language, education, work history) reach 96.7 % accuracy, while predictions based on only two features (name and residence) reach 66.7 %. This finding speaks against name-only inference, which dominates much of the prior literature.
- Migrant founders shape Germany's IT startup landscape more than official figures suggest: Applied to 5,479 founders of IT-related startups in Germany between 2014 and 2024, the heuristic indicates that 30.3 % are of non-German origin compared with 21.1 % in the broader German startup population. This points to substantially higher migrant participation in digital entrepreneurship.
For practice, the findings suggest that transparent, theoretically grounded methods can match or outperform large language models on well-structured inference tasks. Researchers and practitioners working with digital trace data may benefit from integrating multiple profile attributes rather than relying on name-based inference alone. The reliability of inferred origin labels depends strongly on feature availability, so robustness checks and confidence thresholds should accompany any applied use.
Policymakers, accelerators, incubators, and digital firms that depend on digital trace data but lack direct demographic information can benefit from the validated methodology. The study "Tracing origins: Comparing a heuristic, machine learning, and a large language model for migrant identification" by Prof. Dr. David Bendig, Dr. Jonathan Hoke, and Leoni Onken (all University of Münster) is available here (open access): https://www.sciencedirect.com/science/article/pii/S0040162526001435
Contact for inquiries:
Dr. Jonathan Hoke
University of Münster
Institute for Entrepreneurship
Leonardo-Campus 9, 48149 Münster
Email: jhoke@uni-muenster.de