Sunday, August 21, 2016

Adaptive Cursor Sharing Fail

Here is another example (besides the fact that Adaptive Cursor Sharing only gets evaluated during a PARSE call (still valid in 12c) and supports a maximum of 14 bind variables) I've recently come across at a client site where the default implementation of Adaptive Cursor Sharing fails to create a more suitable execution plan for different bind variable values.

Broken down to a bare minimum the query was sometimes executed using non-existing values for a particular bind variable, but other times these values were existing and very popular. There were two suitable candidate indexes and one of them appeared to the optimizer more attractive in case of the "non-existing" value case. Unfortunately this index was a very bad choice for the "existing and very popular" case.

The critical point of the case is that due to other, additional filters on the same table/index the final number of rows produced by the row source operation was (almost) the same for the "non-existing" and the "existing and very popular" case, but in the latter case a large proportion of the bad index had to be scanned due to the popularity of the bind value, which was the only predicate that could be used as ACCESS predicate and the additional ones could only be applied as FILTER on the index.

So although there was a suitable histogram available on the column in question and the optimizer in principle was aware of the different popularities of values and marked the cursor as "bind sensitive" it looks like when a cursor is then monitored it just is assigned to different buckets as exposed via V$SQL_CS_HISTOGRAM, based on the rows produced by the row source. Since the operation in question didn't differ much in the final number of rows produced between the two cases (either 0 rows or 1 row) the different executions were all assigned to the same bucket - although the amount of work required to produce this single row was huge, reflected in execution time and physical / logical I/Os performed per execution.

Obviously this vastly different amount of work performed to produce a similar number of rows is not used as criteria for marking a cursor as "bind aware" and evaluate different execution plans based on ranges of bind values.

There is a view V$SQL_CS_STATISTICS that includes information about CPU_TIME and BUFFER_GETS but in 11.2.0.4 it only reflects the initial execution when starting with the "non-existing" value case, but not the slow executions afterwards. In 12.1.0.2 the view is no longer populated at all, which suggests that this information is simply not used for deciding the "bind aware" status of a cursor.

Discussing the case with Mohamed Houri, who has spent significant time on investigating the "Adaptive Cursor Sharing" feature seemed to confirm this assumption.

Here is a simple test case that allows reproducing the issue:
-- FILTER1 is highly skewed (here one very popular value -1)
create table t
as
select
        rownum as id
      , rownum as n
      , case when rownum <= 100000 then rownum else -1 end as filter1
      , rownum as filter2
      , rpad('x', 200) as filler1
      , rownum as filter3
      , rpad('x', 200) as filler2
from
        dual
connect by level <= 1e6;

-- Histogram on FILTER1 should make the sample query "bind sensitive"
exec dbms_stats.gather_table_stats(null, 't', method_opt => 'for all columns size 1 for columns size 254 filter1')

-- The "good" index, ACCESS on all three columns
-- But higher BLEVEL (3) due to FILLER1 only one value and wide - needs to be part of every branch block
create index t_idx1 on t (filter1, filler1, filter2) compress 2 /*advanced low*/;

-- The "bad" index, ACCESS only on first column, FILTER on subsequent columns
-- But lower BLEVEL (2) due to FILTER3 before FILLER1, leads to much more compact branch blocks
create index t_idx2 on t (filter1, filter3, filler1, filler2, filter2) compress 1 /*advanced low*/;
The critical point is that the good index has a higher BLEVEL than the bad index. So in case a non-existing value for FILTER1 gets used the optimizer will favor the index with the lower BLEVEL, which is for that case the key cost component:
set echo on

-- Non-existing value
-- T_IDX2 gets preferred
explain plan for
select sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     4   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     4   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX2 |     1 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-999999) AND "FILLER1"='x

                                                                                         ' AND
              "FILTER2"=999999)
       filter("FILTER2"=999999 AND "FILLER1"='x

                                                                                      ')
                                                                                     
-- Non-existing value
-- T_IDX1 has a higher cost
explain plan for
select /*+ index(t t_idx1) */ sum(n) from t where filter1 = -999999 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX1 |     1 |       |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-999999) AND "FILLER1"='x

                                                                                         ' AND
              "FILTER2"=999999)
              
-- Existing and very popular value
-- T_IDX1 gets preferred
explain plan for
select sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 |     5   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 |     5   (0)| 00:00:01 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX1 |     1 |       |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-1) AND "FILLER1"='x

                                                                                    ' AND
              "FILTER2"=999999)


-- Existing and very popular value
-- T_IDX2 is a bad idea, correctly reflected in the cost estimate
explain plan for
select /*+ index(t t_idx2) */ sum(n) from t where filter1 = -1 and filler1 = rpad('x', 200) and filter2 = 999999;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------------------------
| Id  | Operation                            | Name   | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |        |     1 |   216 | 58269   (1)| 00:00:03 |
|   1 |  SORT AGGREGATE                      |        |     1 |   216 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T      |     1 |   216 | 58269   (1)| 00:00:03 |
|*  3 |    INDEX RANGE SCAN                  | T_IDX2 |     1 |       | 58268   (1)| 00:00:03 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("FILTER1"=(-1) AND "FILLER1"='x

                                                                                    ' AND
              "FILTER2"=999999)
       filter("FILTER2"=999999 AND "FILLER1"='x

                                                                                      ')
So when optimizing for the two different cases the optimizer does the right thing and understands that for the case "existing and very popular" T_IDX2 is a bad choice. It's also obvious from the "Predicate Information" section that the index T_IDX2 only can use FILTER1 as ACCESS predicate.

But when using bind variables the different cases are not recognized and the bad index is used for both cases when the optimization is based on the "non-existing value" case:
-- Default behaviour, although being BIND_SENSITIVE thanks to the histogram, no ACS kicks in
-- The V$SQL_CS_HISTOGRAM shows all executions being in bucket 0, so according to BIND_SENSITIVE monitoring no need for action
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.28
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> declare
  2    res number;
  3  begin
  4    for i in 1..1000 loop
  5      select sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  6    end loop;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:04:48.64
SQL>

-- With BIND_AWARE a second child cursors gets generated and used on second parse, interesting
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/

SQL> alter session set statistics_level = all;

Session altered.

Elapsed: 00:00:00.03
SQL>
SQL> variable n number
SQL>
SQL> exec :n := -999999

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  5  end;
  6  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.04
SQL>
SQL> exec :n := -1

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.03
SQL>
SQL> declare
  2    res number;
  3  begin
  4    for i in 1..1000 loop
  5      select /*+ bind_aware */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  6    end loop;
  7  end;
  8  /

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.20
SQL>

-- Or hinting for a fixed plan helps, too
set echo on timing on time on

alter session set statistics_level = all;

variable n number

exec :n := -999999

declare
  res number;
begin
  select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
end;
/

exec :n := -1

declare
  res number;
begin
  for i in 1..1000 loop
    select /*+ index(t t_idx1) */ sum(n) into res from t where filter1 = :n and filler1 = rpad('x', 200) and filter2 = 999999;
  end loop;
end;
/
Executing the unhinted, default case leads to a very inefficient execution of the thousand executions using the popular value (and the actual table / index at the client site was much larger than this example and the query got executed very frequently) - compare that to the execution time for the other two variants, either using the BIND_AWARE hint or simply requesting the good index.

So when requesting to mark the cursor immediately BIND_AWARE via the corresponding hint (only possible if the cursor qualifies to be bind sensitive) Oracle happily generates a second child cursor at the second PARSE call with a different, more efficient execution plan for the popular value, because when a cursor is marked bind aware a completely different algorithm gets used that is based on range of values as exposed via V$SQL_CS_SELECTIVITY, and since the different values are falling into different ranges of values a new plan gets generated that is different from the previous one and hence gets used from then on for those (range of) values.

Monday, August 15, 2016

Nested Loop Join Physical I/O Optimizations

Having done my mini-series on Nested Loop join logical I/O optimizations a while ago I unfortunately never managed to publish anything regarding the Nested Loop join physical I/O optimizations, which are certainly much more relevant to real-life performance.

Therefore the main purpose of this blog post is to point you to Nikolay Savvinov's (whose blog I can recommend in general) great mini-series covering various aspects of these optimizations:

Part 1
Part 2
Part 3
Summary

One point that - at least to me - isn't entirely clear when reading Nikolay's series is which specific plan shape he refers to, in particular since in 12c even more plan shapes for a Nested Loop join are possible.

Hence here is an attempt to summarize the various two table Nested Loop join plan shapes as of 12c and what kind of physical I/O optimizations one can expect from them:

1. Nested Loop Join Batching, the default in most cases since 11g and also in 12c
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    NESTED LOOPS               |                   |
|   4 |     INDEX FULL SCAN           | SYS_IOT_TOP_97632 |
|*  5 |     INDEX RANGE SCAN          | T_I6_IDX          |
|   6 |    TABLE ACCESS BY INDEX ROWID| T_I6              |
-----------------------------------------------------------
This is the plan shape that Nikolay calls "batch" or "batching" in his posts. It can provide batched I/O (mainly "db file parallel read" issuing multiple I/O requests in a single call, potentially asynchronous I/O) on both the INDEX RANGE SCAN as well as the TABLE ACCESS BY ROWID operation as described in Nikolay's blog post.

Please note that I deliberately write "can provide batched I/O", because, as Nikolay also points out in his posts, the runtime engine monitors the execution and can dynamically adjust the behaviour, which also means that it might decide to use conventional single block reads ("db file sequential read").

Please also note that in addition to the "db file parallel read" operation that submits multiple I/O requests in a single I/O submit call (see Frits Hoogland's blog post from some time ago about Linux internals of this I/O operation) the runtime engine might use "db file scattered read" multi-block I/Os under certain circumstances, in particular when performing "cache warmup prefetching".

Also note that as Nikolay points out when enabling SQL Trace or "rowsource statistics" the "Batched I/O" optimization for some reason gets disabled.

Also this plan shape at least in 12c seems to lead to inconsistencies in the Real-Time SQL Monitoring in several ways. First the number of "Executions" for the INDEX RANGE SCAN and TABLE ACCESS component of the inner row source is not necessarily consistent with the number of rows in the driving row source, second almost all activity and I/O volume seems to be contributed to the "INDEX RANGE SCAN" plan operation and not the "TABLE ACCESS" operation, even if it's the "TABLE ACCESS" that causes physical I/O.

2. Nested Loop Join Prefetch plan shape including BATCHED ROWID table access (only from 12c on)
------------------------------------------------------------------
| Id  | Operation                            | Name              |
------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                   |
|   1 |  SORT AGGREGATE                      |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|   3 |    NESTED LOOPS                      |                   |
|   4 |     INDEX FULL SCAN                  | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN                 | T_I5_IDX          |
------------------------------------------------------------------
This is the Nested Loop prefetching plan shape introduced in Oracle 9i combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c. It's the plan shape Nikolay refers to with "BATCHED ROWID" in his posts. In my (and Nikolay's) tests this provides the most aggressive batching of I/O (highest number of I/O requests submitted per call in "db file parallel read") for the TABLE ACCESS BY ROWID BATCHED, but didn't perform the same on the index ("db file sequential read", single block reads on the index segment), which doesn't mean that different test and data setups might provide that, too.
Note you should get this plan shape only with explicit hinting or disabling Nested Loop Join Batching via parameter.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
--batch_table_access_by_rowid(b)
The "batch_table_access_by_rowid" isn't strictly necessary since it's the default behaviour in 12c.

Other variants are possible, like "use_nl_with_index" instead of "use_nl" and "index" separately or "opt_param('_nlj_batching_enabled', 0)" instead of "no_nlj_batching" to disable the batching (but then batching is disabled for the whole execution plan, not just for a particular join).

3. Classic Nested Loop Join Prefetch introduced in 9i
----------------------------------------------------------
| Id  | Operation                    | Name              |
----------------------------------------------------------
|   0 | SELECT STATEMENT             |                   |
|   1 |  SORT AGGREGATE              |                   |
|   2 |   TABLE ACCESS BY INDEX ROWID| T_I5              |
|   3 |    NESTED LOOPS              |                   |
|   4 |     INDEX FULL SCAN          | SYS_IOT_TOP_97716 |
|*  5 |     INDEX RANGE SCAN         | T_I5_IDX          |
----------------------------------------------------------
This is what you get in pre-12c when preventing Nested Loop Join Batching, in 12c the BATCHED table access needs to be disabled in addition.

This plan shape provides the less aggressive table prefetching as described in Nikolay's posts (he refers to this plan shape as "Prefetch"), maximum number of requests submitted per call in a "db file parallel read" operation seems to 39.

As mentioned above, it didn't provide index prefetching in my tests.

The session statistics don't mention "Batched I/O", so although the "db file parallel read" wait event is the same the internal implementation obviously is a different code path.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
no_nlj_batching(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

4. Classic Nested Loop Join plan shape with BATCHED ROWID table access (only from 12c on)
-------------------------------------------------------------------
| Id  | Operation                             | Name              |
-------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                   |
|   1 |  SORT AGGREGATE                       |                   |
|   2 |   NESTED LOOPS                        |                   |
|   3 |    INDEX FULL SCAN                    | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T_I5              |
|*  5 |     INDEX RANGE SCAN                  | T_I5_IDX          |
-------------------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i) combined with the new TABLE ACCESS BY INDEX ROWID BATCHED introduced in 12c.

In my tests it provides only "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID BATCHED can make use of "db file parallel read" to batch these I/O requests to the table blocks, and the session statistics show "Batched I/O" counters increasing.

However, it doesn't attempt to optimize "inter-loop" / multiple loop iterations, so if each loop iteration via the INDEX RANGE SCAN only points to a single table block, only single block reads ("db file sequential read") on the TABLE ACCESS BY INDEX ROWID BATCHED can be seen.

In my tests this plan shape didn't provide index prefetching / I/O optimizations and performed single block reads ("db file sequential read") on the INDEX RANGE SCAN operation.

For a two table join it needs explicit hinting in 12c to arrive at this plan shape, but it is the default plan shape for the "inner" joins in case of nested Nested Loop joins (multiple, consecutive Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
--batch_table_access_by_rowid(b)
See my older posts regarding Nested Loop join logical I/O optimizations why the combination of OPT_PARAM nd NO_NLJ_PREFETCH is required to arrive at this plan shape - in short, specifying the obvious NO_NLJ_PREFETCH plus NO_NLJ_BATCHING doesn't work. Other variants as above.

5. Classic Nested Loop Join plan shape
-----------------------------------------------------------
| Id  | Operation                     | Name              |
-----------------------------------------------------------
|   0 | SELECT STATEMENT              |                   |
|   1 |  SORT AGGREGATE               |                   |
|   2 |   NESTED LOOPS                |                   |
|   3 |    INDEX FULL SCAN            | SYS_IOT_TOP_97716 |
|   4 |    TABLE ACCESS BY INDEX ROWID| T_I5              |
|*  5 |     INDEX RANGE SCAN          | T_I5_IDX          |
-----------------------------------------------------------
This is the classic Nested Loop join plan shape (pre-9i).

Interestingly in my tests at least in 12c it provides also "intra-loop" table prefetching, so if the INDEX RANGE SCAN of a single loop iteration points to different table blocks the TABLE ACCESS BY INDEX ROWID can make use of "db file parallel read" to submit multiple of these I/O requests to the table blocks in a single call, but the session statistics don't show "Batched I/O" counters increasing and the maximum number of requests seem to be 39, so it is less aggressive and looks very similar to the internal implementation used for the classic "table prefetching" plan shape above.

There is no sign of "inter-loop" optimizations and the INDEX RANGE SCAN seems to make use of single block reads only ("db file sequential read").

For pre-12c this is default plan shape used for the "inner" joins in case of nested Nested Loop joins (multiple Nested Loop joins in a row), see below for more information.

In 12c, given a two table join between alias A and B, the following hints would be required to arrive at this plan shape:
leading(a b)
use_nl(b)
index(b)
opt_param('_nlj_batching_enabled', 0)
no_nlj_prefetch(b)
no_batch_table_access_by_rowid(b)
Other variants as above.

Multiple, consecutive Nested Loop Joins


Another point that I think it is important to mention when describing these "inter-loop" prefetching and batching Nested Loop join optimization techniques is that they only apply to the "outer-most" Nested Loop join in case of multiple, consecutive Nested Loop joins, which makes their "game changing" character that Nikolay mentions in his posts less strong to me.

For example, this is the 12c default shape of a four table join using three Nested Loop joins:
----------------------------------------------------------------------
| Id  | Operation                                | Name              |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                   |
|   1 |  SORT AGGREGATE                          |                   |
|   2 |   NESTED LOOPS                           |                   |
|   3 |    NESTED LOOPS                          |                   |
|   4 |     NESTED LOOPS                         |                   |
|   5 |      NESTED LOOPS                        |                   |
|   6 |       INDEX FULL SCAN                    | SYS_IOT_TOP_97632 |
|   7 |       TABLE ACCESS BY INDEX ROWID BATCHED| T_I1              |
|*  8 |        INDEX RANGE SCAN                  | T_I1_IDX          |
|   9 |      TABLE ACCESS BY INDEX ROWID BATCHED | T_I2              |
|* 10 |       INDEX RANGE SCAN                   | T_I2_IDX          |
|* 11 |     INDEX RANGE SCAN                     | T_I3_IDX          |
|  12 |    TABLE ACCESS BY INDEX ROWID           | T_I3              |
----------------------------------------------------------------------
As it can be seen only the outer-most Nested Loop joins gets the "batching" plan shape and can benefit from the optimizations described above. The inner Nested Loop joins show the classic plan shape, in case of 12c with the ROWID BATCHED option, so they can only benefit from the "intra-loop" optimizations described above, if a single loop iteration INDEX RANGE SCAN points to several table blocks.

Note that even when using explicit, additional NLJ_BATCHING hints for the inner tables joined I wasn't able to enforce any other plan shape.

If there are multiple blocks of (consecutive) Nested Loop joins (for example a HASH JOIN in between), then in each block the outer-most Nested Loop join gets the optimization, so multiple of those are possible per execution plan:
---------------------------------------------------------------------
| Id  | Operation                               | Name              |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT                        |                   |
|   1 |  SORT AGGREGATE                         |                   |
|   2 |   NESTED LOOPS                          |                   |
|   3 |    NESTED LOOPS                         |                   |
|*  4 |     HASH JOIN                           |                   |
|   5 |      TABLE ACCESS BY INDEX ROWID BATCHED| T_I2              |
|   6 |       INDEX FULL SCAN                   | T_I2_IDX          |
|   7 |      NESTED LOOPS                       |                   |
|   8 |       NESTED LOOPS                      |                   |
|   9 |        INDEX FULL SCAN                  | SYS_IOT_TOP_97632 |
|* 10 |        INDEX RANGE SCAN                 | T_I1_IDX          |
|  11 |       TABLE ACCESS BY INDEX ROWID       | T_I1              |
|* 12 |     INDEX RANGE SCAN                    | T_I3_IDX          |
|  13 |    TABLE ACCESS BY INDEX ROWID          | T_I3              |
---------------------------------------------------------------------
As it can be seen now that T_I2 is joined via a HASH JOIN both Nested Loop joins to T_I1 and T_I3 get the "batched" optimization plan shape.

I don't know why this limitation is there that only the outer-most in a block of Nested Loop joins gets the optimized plan shape, but it certainly can reduce the possible benefit.

Since I don't think there is a costing difference on CBO level for the different Nested Loop join plan shapes (so the decision which shape to use isn't cost-driven) it might be possible to see a significant performance difference depending on which join gets the optimization - there might be room for improvement by manually influencing the join order in case of multiple, consecutive Nested Loop joins.