Evaluation of an Algorithmic Hiring System: A Public-Sector Case Study

Jun 30

Eticas.ai conducted an independent fairness and bias evaluation of an algorithmic recruitment system operated by a public employment agency in Europe. Using a post-deployment, end-to-end methodology applied to five years of operational data, the evaluation identified systematic disparities in shortlisting outcomes by gender, age, education level and national origin. Key findings include adverse impact for women in mid-salary roles (Disparate Impact Ratio, DIR = 0.786, below the recognized 0.80 threshold) and the near-total exclusion of candidates aged 55 and over, a group that makes up 15.6% of the local labor force. The evaluation produced targeted recommendations on governance, keyword protocols and GDPR compliance, and it illustrates why independent fairness evaluation of AI recruitment systems matters, as bias in hiring can otherwise go undetected.

“Most auditing frameworks assume systems are coherent, observable and technically bounded. In practice, real-world AI deployments are fragmented socio-technical assemblages. This case demonstrates that advancing trustworthy AI requires guaranteed access to vendor data and processes. Without it, auditing remains partial and reactive.
”

— Gemma Galson-Calvell, Eticas.ai CEO

Context: Why This Evaluation Matters

The client is a public employment agency operating a labor intermediation service for a major European city. The agency relies on a third-party AI platform to support candidate search and shortlisting: a seven-stage process that combines automated filtering with human analyst judgment.

AI-assisted hiring systems are classified as high-risk under the EU AI Act. Bias embedded in training data, employer-defined criteria or system design can systematically exclude protected groups by sex, age or origin, compounding inequalities already present in the labor market. As a public agency with an explicit mandate to promote equal access to employment, ensuring its processes do not systematically disadvantage particular groups is both a legal obligation and an institutional priority and a useful reference point for any organization weighing public sector AI evaluation requirements.

What We Evaluated

The evaluation covered the recruitment pipeline stages over which the agency has direct operational responsibility — from candidate registration through to the point the shortlist is sent to hiring companies. Decisions employers make about who to interview or hire fall outside this scope.

The system comprises seven stages:

1. Receipt of a vacancy from a hiring company

2. Assignment of the vacancy to an agency analyst

3. Analyst keyword selection and search criteria definition

4. Automated filtering and ranking by the third-party platform

5. Return of filtered candidate profiles to the analyst

6. Manual analyst review and shortlist selection

7. Shortlist sent to the hiring company

The third-party platform's internal matching and ranking logic is proprietary and was not disclosed to the agency or to Eticas.ai during the evaluation. That opacity is itself a material finding: vendors of "black box" recruitment tools can withhold the logic that determines who gets shortlisted, leaving deploying organizations unable to verify — or defend — what their own systems are doing. Independent evaluation exists precisely to compensate for that gap, reconstructing fairness evidence from outcomes when the underlying logic stays hidden.

Methodology

Eticas.ai applied its post-deployment algorithmic impact evaluation methodology, examining the system across three analytical layers:

- Pre-processing: who enters the pipeline, what data is collected and whether the candidate population reflects the demographic makeup of the labor market, benchmarked against external labor force survey data.

- In-processing: how the system processes candidates once they enter, including bias-risk mapping across every stage, Disparate Impact analysis, intersectional analysis and statistical significance testing.

- Post-processing: temporal consistency analysis to assess whether shortlisting patterns are stable or shifting over time, and whether disparities persist, narrow or widen across the evaluated period.

Primary fairness metric: Disparate Impact Ratio (DIR) = selection rate of protected group ÷ selection rate of the highest-performing reference group. A DIR below 0.80 signals potential adverse impact, following the EEOC four-fifths rule as an established practitioner benchmark. The evaluation was carried out with reference to the risk-management obligations for high-risk AI systems under the EU AI Act (Regulation (EU) 2024/1689) and assessed data-handling practices against GDPR (Regulation (EU) 2016/679) requirements.

Key Findings

Gender

At the aggregate level, overall hiring outcomes look equitable across binary genders — the share of women in final outcomes drops by less than two percentage points from the candidate pool, a difference that isn't statistically significant. But the aggregate picture hides meaningful stage-specific disparities.

In mid-salary vacancies (€15,000–€24,000), women are shortlisted at 7.43% compared with 9.45% for men (DIR = 0.786, below the 0.80 threshold). For high-salary vacancies (above €24,000), the DIR is 0.829 — above the threshold but statistically significant and worth monitoring.

Women are matched to vacancies with lower minimum salaries than men across 15 of 20 sectors. The largest gaps appear in Internet and Technology (€2,012), Engineering and Automotive (€1,994), Art and Design (€1,727), and Real Estate and Architecture (€1,675). The gap persists within sectors, not just across them, meaning sector composition alone doesn't explain the result.

Shortlisting compounds sector-level gender imbalances at specific pipeline stages. In Real Estate and Architecture, women's representation falls from 27.0% of the matched pool to 14.6% of the shortlist — a 12.4-point reduction at that stage alone. Since candidates don't self-select into roles (they're shortlisted by human analysts), this disparity can't be attributed to application behavior.

Non-Binary Candidates

Non-binary candidates are shortlisted in only 3.51% of pipeline entries (DIR = 0.295 — less than a third of the rate for men). This result is statistically significant even accounting for small sample sizes. The agency currently has no mechanism to detect, monitor, or investigate this pattern.

Age

Adults aged 55 and over represent approximately 15.6% of the city's active labor force but are absent from every stage of the pipeline. The evaluation can't attribute this to a single cause — it may reflect registration barriers, vacancy composition, or data gaps — but the absence is complete and requires investigation.

Applicants aged 16–25 show the highest discard rate (approximately 13%) and the lowest shortlisting rate of any age group. The largest gender shortlisting disparity by age occurs in the 46–55 group, where men are shortlisted at 13.3% compared with 10.4% for women (DIR = 0.77).

Education Level

Candidates with higher education qualifications — bachelor's degrees, university degrees, and graduate credentials — are shortlisted at rates 40–42% lower than secondary school graduates (DIRs 0.576–0.600). Higher-educated candidates skew female (58–62% women at degree and graduate level, versus 37% at secondary certificate level). The system's apparent penalization of overqualification falls disproportionately on female candidates.

National Origin

Candidates from outside the host country are shortlisted at a significantly lower rate than nationals, with DIR values below 0.80 for several origin groups. Country of origin functions as a proxy for nationality, ethnicity, and migrant background — all protected attributes under EU and national anti-discrimination law. The evaluation could not isolate whether this effect is driven by vacancy requirements, analyst search behavior, or platform filtering.

A Structural Challenge Beyond This Organization

The conditions found in this evaluation are not exceptional. Third-party platforms with proprietary logic, incomplete demographic data, and undocumented human discretion at multiple decision points characterize the majority of deployed AI hiring systems. Under these conditions:

- Fairness cannot be cleanly measured when key filtering logic is proprietary and undisclosed.

- Responsibility cannot be attributed to a single component when human and automated decisions are interleaved and undocumented.

- Aggregate temporal stability can mask underlying shifts — shortlisting gaps narrowed over the five-year period, but so did overall shortlisting rates for all groups.

This is not a failure of the evaluation. It's a structural feature of how AI systems are currently deployed. Organizations using third-party AI hiring tools should not assume that what they cannot see is not producing harm.

Recommendations

Based on the evaluation findings, Eticas.ai provided the following recommendations:

Keyword protocol: Implement a standardized, auditable protocol for analyst searches, including approved terms, a requirement to include all grammatical gender forms, a list of prohibited gender- or age-associated terms, and mandatory search logging per vacancy.
Missing data: Conduct a root-cause analysis of missing demographic data: 24.0% of candidates have no gender recorded and 14.5% have no origin recorded, the two attributes with the most severe outcome gaps.
Age 55+ absence: Investigate the complete absence of candidates aged 55+, including whether registration design, profile requirements, or system filters create structural barriers for this group.
Filtering criteria: Review filtering criteria for age and gender disparities. Where employer requirements appear to encode indirect bias, analysts should be empowered and trained to challenge and renegotiate them.
Profile completion: Add a visible profile completion indicator. Candidates with 0–3 completed fields are shortlisted at 1.3–2.6%, versus 10.4–13.2% for those with 4–5 fields, a fivefold gap not disclosed to candidates.
Monitoring framework: Establish a continuous fairness monitoring framework: quarterly DIR computation by sex, age, and origin; salary band tracking by sex within each sector; non-binary progression monitoring; and annual representativeness benchmarking.
Analyst training: Conduct analyst training on non-discrimination obligations, covering how to identify discriminatory requirements, apply gender-neutral search practices and meet applicable legal obligations.
Vendor access: Condition future platform engagements on guaranteed access to vendor documentation and decision logic. Proprietary matching and ranking logic is a structural transparency barrier that limits any fairness evaluation.
GDPR review: Conduct a dedicated GDPR and data governance review covering data processing agreements, candidate consent at registration, and the use of sensitive attributes in semi-automated decision-making.

What This Means for Organizations Deploying AI Hiring Systems

This evaluation should be read not only as an assessment of one system, but as a structural case study with implications for any organization deploying AI-assisted recruitment without systematic fairness monitoring.

Aggregate-level fairness metrics can mask important disparities concentrated at specific pipeline stages, salary bands, contract types or demographic intersections. Equitable overall hiring outcomes do not mean the system is fair at the points where filtering actually happens.

Vendor opacity is itself a governance failure. An organization deploying a third-party AI tool that cannot explain its ranking and filtering logic cannot know whether that tool is producing discriminatory outcomes, and cannot defend itself if it is.

Independent AI fairness evaluation is not a compliance checkbox. It's the mechanism by which an organization takes active, verifiable control of what its AI systems are actually doing in production.

Case study

Noelia Amoedo