Skip to content

ORCA: don't push LOJ ON-pred onto its own outer in PushThruOuterChild#1836

Open
yjhjstz wants to merge 2 commits into
apache:mainfrom
yjhjstz:fix/orca-loj-bool-on-pred-pushdown
Open

ORCA: don't push LOJ ON-pred onto its own outer in PushThruOuterChild#1836
yjhjstz wants to merge 2 commits into
apache:mainfrom
yjhjstz:fix/orca-loj-bool-on-pred-pushdown

Conversation

@yjhjstz

@yjhjstz yjhjstz commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Fix a wrong-result bug in ORCA where a LEFT JOIN's own ON predicate ends up duplicated as a scan filter on the join's outer relation, discarding outer rows
that LOJ semantics require to be null-padded.

Repro

CREATE TABLE x(c1 boolean);
CREATE TABLE y1(c1 boolean);
CREATE TABLE y2(c1 boolean);
INSERT INTO x  VALUES (true),(false),(false);
INSERT INTO y1 VALUES (true);
INSERT INTO y2 VALUES (true);

SET optimizer = on;
SELECT x.c1, y2.c1 AS y2c1
  FROM x LEFT JOIN y1 ON x.c1
         LEFT JOIN y2 ON x.c1
 WHERE y2.c1 IS NULL;
Optimizer Result Seq Scan on x
Postgres planner 2 rows (correct) no scan filter
ORCA (before fix) 0 rows Filter: c1

Trigger: two or more chained LEFT JOINs whose ON-clauses use the same boolean column from the outer relation, with a WHERE on top.

Root cause

CNormalizer::PushThruOuterChild is invoked from PushThruSelect with a predicate that happens to be (or contain) the LOJ's own ON predicate.
SplitConjunct / FPushable accept it as pushable to the outer relation, because FPushable only checks that the predicate's columns are a subset of the
outer's output columns — an LOJ ON-pred that references only outer-side columns trivially satisfies that.

Consequence: ORCA wraps the outer with Select(outer, on_pred), producing LOJ(Select(x, c1), inner, x.c1). The LOJ's ON-pred is preserved as Join Filter, but a redundant Filter: c1 is also planted on the outer scan, which discards outer rows that don't satisfy the ON-pred — exactly the rows LOJ
must null-pad and keep.

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


yjhjstz added 2 commits July 2, 2026 13:58
When two non-inner joins in an NAry join have structurally identical ON
predicates, ORCA could silently drop or misplace predicates, producing
wrong results:

  select x.c1, y2.c1 from x left join y1 on x.c1
                            left join y2 on x.c1
   where y2.c1 is null;

returned 0 rows instead of the two null-padded FALSE rows, because a
copy of the ON pred ended up as a scan filter on x.

Root cause: CJoinOrderDPv2's m_expression_to_edge_map is keyed on
structural equality (CExpression::HashValue / CUtils::Equals). With two
structurally identical ON preds, RecursivelyMarkEdgesAsUsed can only
ever mark one of the duplicate edges as used, so
AddSelectNodeForRemainingEdges treated the other edge as a leftover
WHERE predicate and emitted it into a Select on top of the join tree.
The normalizer then legitimately pushed that Select onto the LOJ's own
outer child, filtering out rows that outer-join semantics require to be
null-padded. (The map is only populated when a WHERE predicate
references an NIJ right child, which is why the WHERE clause is needed
to trigger the bug.)

Fix at the source: skip ON-pred edges (m_loj_num > 0) when collecting
remaining edges. An NIJ's ON predicate is always applied by the join
itself when its right child is placed (IsRightChildOfNIJ), so an
"unused" ON-pred edge can only be a bookkeeping artifact of the
structural-equality map and must never be duplicated above the join.

An earlier attempt fixed this downstream, by stripping conjuncts that
structurally match the LOJ's ON pred in CNormalizer::PushThruOuterChild.
That layer cannot distinguish the leaked ON-pred copy from legitimate,
structurally identical conjuncts arriving from above, and silently
deleted user predicates:

  select * from x left join y on x.c1 where x.c1;       -- 3 rows, not 1
  select 1 from a t1
    left join (a t2 left join a t3 on t2.id = 1)
    on t2.id = 1;                                        -- lost the
                                                         -- Index Cond on
                                                         -- t2 and the ON
                                                         -- pred entirely

With this fix, the original repro returns the correct 2 rows with no
scan filter on x, the queries above return planner-identical results,
and the nested-LOJ query regains Index Cond: (id = 1) on t2.

Add the repro as a regression test in bfv_joins.
…suite

Mirror the bfv_joins regression case for the duplicate-ON-pred DPv2 bug
(two LEFT JOINs sharing the same boolean ON column plus a WHERE on the
inner side) into the pax_storage copy of the suite.

Unlike the earlier version of this mirror, join_optimizer.out is left
untouched: with the root-cause fix in CJoinOrderDPv2 the previously
refreshed plan (Seq Scan on t2, outer Join Filter reduced to true) no
longer exists; the original expected plan with Index Cond: (id = 1) on
t2 is produced again.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants