WCAG 2.2: What Changed, Why It Matters, and How to Implement It

Nine new success criteria. One removed. Here is what every frontend engineer needs to know.

WCAG 2.2 became an official W3C Recommendation on December 12, 2024. If your team is still targeting 2.1 as a compliance baseline, you are already behind. The W3C explicitly advises using 2.2 to maximize future applicability of accessibility efforts, and regulators in the EU, UK, and US are actively aligning their policies to the latest version.

This article covers every new success criterion using a consistent format: what the spec requires, why the criterion exists and who it protects, and how to implement it in practice.

What Was Removed First: 4.1.1 Parsing

Before the new criteria, one was cut. WCAG 2.2 removed 4.1.1 Parsing, which previously required well-formed HTML so assistive technologies could reliably parse it.

Why removed: Modern browsers and screen readers have become resilient enough to handle malformed markup without accessibility failures. The criterion no longer reliably predicted real-world accessibility outcomes, so the working group dropped it.

Practical note: If your organization is contractually obligated to WCAG 2.0 or 2.1 conformance, you may still need to test and report on 4.1.1 separately. For new 2.2 audits, it is gone.

The 9 New Success Criteria

1. Focus Not Obscured (Minimum) — 2.4.11 — Level AA

What: When a UI component receives keyboard focus, the focused element must not be entirely hidden by author-created content. Partially obscured is acceptable at this level. Entirely hidden is not.

Why: Users who navigate by keyboard (people with motor disabilities, switch device users, power users) need to see where focus is at all times. Sticky headers, floating cookie banners, fixed chat widgets, and bottom navigation bars are the most common offenders. When focus moves behind one of these layers and disappears completely, the user loses their place on the page with no visual cue for what is currently selected. This is especially disorienting for users with cognitive disabilities who are more sensitive to context loss during a task.

How to implement:

The core fix is ensuring scroll-padding-top accounts for any fixed header height so the browser scrolls enough to keep focused elements visible.

/* If your sticky header is 64px tall */
html {
  scroll-padding-top: 80px; /* header height + breathing room */
}

/* Alternatively, scoped to focusable elements */
a:focus,
button:focus,
[tabindex]:focus {
  scroll-margin-top: 80px;
}

For dynamic header heights (collapsing navs, announcement banners that appear after load), update the value from JavaScript:

function updateScrollPadding() {
  const header = document.querySelector('.sticky-header');
  const height = header?.getBoundingClientRect().height ?? 0;
  document.documentElement.style.scrollPaddingTop = `${height + 16}px`;
}

window.addEventListener('resize', updateScrollPadding);
updateScrollPadding();

Test it: Tab through your page with a sticky header visible. Every focused element should be at least partially visible above the fold.

2. Focus Not Obscured (Enhanced) — 2.4.12 — Level AAA

What: Same intent as 2.4.11, but stricter. The focused component must not be obscured at all, not even partially.

Why: At AA (2.4.11), a focused element that is 10% visible technically passes. For users with low vision who rely on high zoom levels or screen magnification, even partial obscuring can make the focus indicator undetectable in practice. The AAA version closes that gap entirely.

How to implement:

Everything from 2.4.11 applies. The additional requirement is that no part of the focused element is covered by overlapping author-created content. In practice this means:

  • scroll-padding values must fully clear the focused element above any sticky layers.
  • Fixed overlays (modals, drawers, sheets) must trap focus inside themselves while open, so keyboard focus can never land on content behind them.
// Trap focus inside an open modal
function trapFocus(modalElement) {
  const focusable = modalElement.querySelectorAll(
    'a, button, input, textarea, select, [tabindex]:not([tabindex="-1"])'
  );
  const first = focusable[0];
  const last = focusable[focusable.length - 1];

  modalElement.addEventListener('keydown', (e) => {
    if (e.key !== 'Tab') return;
    if (e.shiftKey) {
      if (document.activeElement === first) {
        e.preventDefault();
        last.focus();
      }
    } else {
      if (document.activeElement === last) {
        e.preventDefault();
        first.focus();
      }
    }
  });
}

3. Focus Appearance — 2.4.13 — Level AAA

What: When a keyboard focus indicator is visible, it must meet specific size and contrast requirements. The focus indicator area must be at least the perimeter of the unfocused component multiplied by 2 CSS pixels. The contrast ratio between focused and unfocused states must be at least 3:1 against adjacent colors.

Why: Browser default focus outlines are frequently invisible against common backgrounds, and many codebases globally suppress them with outline: none (still a widespread anti-pattern). Users with low vision, cognitive disabilities, and anyone relying entirely on keyboard navigation depend on a focus indicator that is visually obvious, not just technically present. A faint, thin blue ring at low contrast does not serve these users.

How to implement:

The first step is removing the global outline: none pattern. If you need to suppress the browser ring for mouse users, use :focus-visible instead of :focus:

/* Wrong: removes focus ring for everyone, including keyboard users */
*:focus {
  outline: none;
}

/* Right: removes ring only when pointer (not keyboard) is in use */
*:focus:not(:focus-visible) {
  outline: none;
}

/* Custom focus indicator that satisfies AAA geometric and contrast requirements */
*:focus-visible {
  outline: 3px solid #0f62fe;
  outline-offset: 2px;
  border-radius: 2px;
}

A practical shortcut: a 3px solid outline in a color with at least 3:1 contrast against the surrounding background satisfies the geometric requirement for most standard interactive components. For components on dark surfaces, check the contrast of your focus color against the dark background, not just the default page background.

4. Dragging Movements — 2.5.7 — Level AA

What: Any functionality that uses a dragging movement (click-and-drag, touch drag) must also be achievable with a single pointer action (click or tap) without dragging. Exceptions apply only when the drag is essential to the functionality itself.

Why: Dragging requires simultaneously pressing, holding, and moving a pointer. This compound gesture is unreliable or impossible for users with hand tremors, limited fine motor control, or motor disabilities affecting pointer precision. Sortable lists, kanban boards, sliders, map pan gestures, and date range pickers are common failure cases. The criterion does not prohibit drag interactions. It requires that a non-drag path exists to accomplish the same result.

How to implement:

For sortable lists, provide explicit move buttons alongside the drag handle:

function SortableItem({ item, onMoveUp, onMoveDown }) {
  return (
    <div draggable onDragStart={...} onDragEnd={...}>
      <span>{item.label}</span>
      <button aria-label={`Move ${item.label} up`} onClick={onMoveUp}></button>
      <button aria-label={`Move ${item.label} down`} onClick={onMoveDown}></button>
    </div>
  );
}

For range sliders, use native <input type="range"> wherever possible. It supports arrow key adjustment out of the box. Custom slider implementations frequently break keyboard support:

<input
  type="range"
  min={0}
  max={100}
  value={value}
  onChange={(e) => setValue(Number(e.target.value))}
  aria-label="Price range maximum"
/>

For map or canvas drag interactions, provide explicit pan controls: arrow-key panning and clickable pan buttons in the UI.

5. Target Size (Minimum) — 2.5.8 — Level AA

What: The size of the pointer target for interactive elements must be at least 24×24 CSS pixels. Exceptions apply when: the target’s offset from adjacent targets is at least 24px, the target is inline within text content, the browser controls the target size (default form controls), or a small size is essential to the information conveyed.

Why: Small tap targets fail users with tremors, limited dexterity, or motor disabilities who use alternative pointer devices with reduced precision. Tightly packed icon buttons, small checkboxes, link-dense navigation menus, and close buttons in notification toasts are the most common failure patterns. Note that 24x24px is the AA minimum. The AAA version (2.5.5, carried forward from 2.1) requires 44x44px. Most mobile UX guidelines already recommend 44px. WCAG 2.2 establishes the legal floor.

How to implement:

Set a baseline minimum for all interactive elements:

button,
a,
[role="button"],
input[type="checkbox"],
input[type="radio"] {
  min-width: 24px;
  min-height: 24px;
}

/* Prefer the AAA-level 44x44 on touch interfaces */
@media (pointer: coarse) {
  button,
  a,
  [role="button"] {
    min-width: 44px;
    min-height: 44px;
  }
}

For icon-only buttons where the visual size is constrained by design, expand the hit area using padding while keeping the visible footprint the same:

.icon-button {
  padding: 10px; /* expands hit area to 44x44 if icon is 24x24 */
  display: inline-flex;
  align-items: center;
  justify-content: center;
}

The spacing exception is a legitimate tool for constrained layouts. If two 16×16 icons are spaced so their center-to-center distance is 24px or more, they satisfy the minimum even without being 24×24 in physical size. Use it as a fallback, not a design default.

6. Consistent Help — 3.2.6 — Level A

What: If a web page provides a help mechanism (human contact, self-help documentation, automated contact, or a contact form), that mechanism must appear in the same relative location across all pages within the site.

Why: Users with cognitive disabilities often need help completing tasks and struggle when support resources appear in different places on different pages. If the help icon is in the top-right corner on the homepage but shifts to the footer on the checkout page, the inconsistency creates friction precisely when the user is most likely to need assistance. This criterion does not require that you have a help mechanism. It only requires that if you do, its location is stable.

How to implement:

This is primarily a layout and design system decision. Anchor help mechanisms inside a shared layout component so they cannot drift between pages:

// Layout.jsx
export function Layout({ children }) {
  return (
    <>
      <GlobalHeader />     {/* help trigger lives here, always */}
      <main>{children}</main>
      <GlobalFooter />
    </>
  );
}

Avoid conditionally hiding the help trigger on specific page types. If suppression is unavoidable (full-screen checkout flows, immersive experiences), make sure the mechanism reappears in the same location once normal layout resumes.

“Same relative location” means the same area of the page (top-right, bottom-right, etc.), not exact pixel coordinates. Responsive layouts that shift the help button between breakpoints are acceptable as long as it is consistently placed within each breakpoint’s layout pattern.

7. Redundant Entry — 3.3.7 — Level A

What: Information that a user has already provided in a multi-step process must either be auto-populated in subsequent steps or be selectable from previously entered values. Users must not be required to re-enter the same information within the same session unless re-entry is essential (e.g., password confirmation for security) or the information is no longer valid.

Why: Re-entering data is a significant cognitive and motor burden. For users with cognitive disabilities, being asked to retype a name or address they entered three steps ago interrupts task flow, increases error likelihood, and often causes abandonment. For users with motor disabilities, every additional keystroke carries a real physical cost. This criterion formalizes what good UX already recommends: do not ask for something you already have.

How to implement:

In a multi-step React form, store session state at a high level and pre-populate later steps:

// FormContext.jsx
const FormContext = React.createContext({});

export function FormProvider({ children }) {
  const [formData, setFormData] = React.useState({});

  const updateFormData = (values) => {
    setFormData((prev) => ({ ...prev, ...values }));
  };

  return (
    <FormContext.Provider value={{ formData, updateFormData }}>
      {children}
    </FormContext.Provider>
  );
}

// Step 3: Shipping -- pre-populate from billing when same
function ShippingStep() {
  const { formData, updateFormData } = React.useContext(FormContext);
  const [sameAsBilling, setSameAsBilling] = React.useState(false);

  const address = sameAsBilling
    ? formData.billingAddress
    : formData.shippingAddress;

  return (
    <>
      <label>
        <input
          type="checkbox"
          checked={sameAsBilling}
          onChange={(e) => setSameAsBilling(e.target.checked)}
        />
        Same as billing address
      </label>
      <AddressFields defaultValues={address} onChange={...} />
    </>
  );
}

The “same as billing” pattern already present in most e-commerce checkouts is a textbook 3.3.7 implementation. Apply the same logic to any multi-step flow where information asked in step N could have been collected in step N-1 or earlier.

8. Accessible Authentication (Minimum) — 3.3.8 — Level AA

What: A cognitive function test (memorizing a password, solving a puzzle, transcribing characters) must not be required at any step of an authentication process unless: an alternative authentication method is available that does not require a cognitive function test, a mechanism is available to help complete the test (such as copy-paste support or a password manager), or the test involves recognizing objects or personal content the user themselves provided.

Why: Password recall is itself a cognitive function test. Many users with cognitive disabilities, memory impairments, or learning disabilities cannot reliably memorize and recall complex passwords on demand. CAPTCHAs add another cognitive or visual puzzle on top of that. This criterion protects access to the authentication layer itself, which is a prerequisite for using everything else on the platform.

How to implement:

The highest-impact single change: allow paste into password fields and respect autocomplete attributes. Blocking paste breaks password managers and forces manual re-entry.

// Wrong: blocks paste, breaks password managers
<input
  type="password"
  onPaste={(e) => e.preventDefault()}
/>

// Right: paste allowed, autocomplete declared
<input
  type="password"
  autoComplete="current-password"
/>

Use the correct autocomplete values for the browser and password managers to fill credentials automatically:

<input type="email" autocomplete="username" />
<input type="password" autocomplete="current-password" />
<input type="password" autocomplete="new-password" /> <!-- registration -->

Additional paths to compliance:

  • Offer magic link login (no password to recall).
  • Support passkeys as an alternative.
  • If you use a CAPTCHA, provide an audio alternative and a non-CAPTCHA path for users who cannot complete visual challenges.

The object recognition exception covers CAPTCHAs that ask users to identify images they uploaded themselves (security images, personal photos). These are permitted because the cognitive anchor is personal memory, not abstract recall.

9. Accessible Authentication (Enhanced) — 3.3.9 — Level AAA

What: Same as 3.3.8, but removes the object recognition and personal content exceptions. No cognitive function test of any kind is permitted anywhere in the authentication flow.

Why: Even object recognition and personal image selection require a level of memory and visual processing that users with severe cognitive or visual disabilities may not be able to reliably perform. The AAA version is an absolute requirement: authentication cannot depend on cognitive tests at all.

How to implement:

The only conformant paths at AAA are authentication methods that require no cognitive recall:

Passkeys (WebAuthn/FIDO2): device-level biometric or PIN-based auth. No password to memorize or recall.

// Passkey authentication
const assertion = await navigator.credentials.get({
  publicKey: {
    challenge: serverGeneratedChallenge,
    allowCredentials: [{ type: 'public-key', id: existingCredentialId }],
    userVerification: 'preferred',
  },
});
// Send assertion to server for verification

Magic links: one-time login URLs delivered to a verified email or phone. The user clicks a link in their inbox. No password involved.

SSO delegation: the authentication burden is delegated to a trusted identity provider. The provider’s own authentication is outside your conformance boundary.

AAA is not required for most products, but passkeys are rapidly becoming the industry default regardless of compliance requirements. Implementing them satisfies the accessibility requirement and the general trend toward passwordless authentication simultaneously.

Audit Checklist for WCAG 2.2 Compliance

If you are auditing an existing product, prioritize in this order:

Level A (minimum baseline)

  • [ ] Help mechanisms appear in a consistent location across all pages (3.2.6)
  • [ ] Multi-step forms do not re-ask for information already collected in the session (3.3.7)

Level AA (legal and enterprise standard)

  • [ ] No focused element is entirely hidden by sticky headers, footers, or overlays (2.4.11)
  • [ ] All interactive targets are at least 24×24 CSS pixels or have adequate spacing (2.5.8)
  • [ ] Every drag interaction has a single-pointer alternative (2.5.7)
  • [ ] Password fields allow paste and declare correct autocomplete attributes (3.3.8)
  • [ ] No authentication step requires a cognitive function test without an accessible alternative (3.3.8)

Level AAA (aspirational or contractual)

  • [ ] No focused element is partially obscured by author-created overlays (2.4.12)
  • [ ] Focus indicators meet minimum size and 3:1 contrast requirements (2.4.13)
  • [ ] Authentication requires no cognitive function tests of any kind (3.3.9)

The Bigger Picture

WCAG 2.2’s additions are tightly scoped around three user groups: people with cognitive or learning disabilities, users with low vision, and users on mobile and touch devices. Every new criterion maps to a failure mode that real products ship regularly: password fields that block paste, drag interactions with no keyboard fallback, sticky headers that swallow focused elements, icon buttons too small to tap precisely.

None of these fixes are expensive once you know what to look for. The authentication changes are often one autocomplete attribute away. The target size and focus visibility issues are a few lines of CSS. The redundant entry problem is a state management question you have probably already partially solved elsewhere in your codebase.

The investment is low. The user impact is not.

Questions about implementing any of these criteria? Drop them in the comments.

Mastering the Orchestration Pattern in React: Taming Complex Component Logic

TL;DR: The Orchestration Pattern is a powerful way to manage complex interactions between components, API calls, and state updates in React. Instead of letting logic scatter across dozens of useEffect hooks and event handlers, you centralize it into a dedicated “orchestrator” component or hook. This approach makes your code more predictable, testable, and maintainable—especially in enterprise applications with complex workflows.

The Problem: When React Components Become Spaghetti

Let’s be honest. We’ve all been there. You start building a feature—say, a multi-step checkout form. Initially, it’s simple. A few inputs, a submit button.

But then requirements grow:

  • “We need to validate the address against a third-party API.”
  • “If the user is a returning customer, pre-fetch their saved payment methods.”
  • “Apply discount codes, but only after shipping is calculated.”
  • “If payment fails, show a specific error and roll back the shipping selection.”

Suddenly, your component looks like this:

const Checkout = () => {
  const [step, setStep] = useState(1);
  const [cart, setCart] = useState(null);
  const [shipping, setShipping] = useState(null);
  const [payment, setPayment] = useState(null);
  const [discount, setDiscount] = useState(null);
  const [errors, setErrors] = useState({});
  const [loading, setLoading] = useState(false);

  useEffect(() => {
    // Fetch cart on mount
  }, []);

  useEffect(() => {
    // Recalculate shipping when address changes
  }, [address]);

  useEffect(() => {
    // Apply discount when cart or code changes
  }, [discountCode, cart]);

  const handlePayment = async () => {
    // Complex logic with multiple steps and error handling
  };

  // 300+ more lines of imperative, hard-to-follow code...
};

This is imperative spaghetti. The “what” (user wants to checkout) is buried in the “how” (fetch this, update that, call this API, show this error). It’s hard to test, hard to debug, and even harder for new team members to understand.

Enter the Orchestration Pattern.

What Is the Orchestration Pattern in React?

Inspired by backend microservices architecture (where an orchestrator coordinates multiple services), the Orchestration Pattern in React applies the same principle: centralize complex workflow logic into a single coordinator.

Think of it like a movie director:

  • Orchestrator (Director): Knows the script. Calls “Action!” to the camera team, tells the actor when to enter, signals the lighting crew.
  • Components/APIs (Actors/Crew): Do one thing well. They don’t know the full script—they just respond to commands.

In React terms, the orchestrator manages:

  • The sequence of operations (API call A, then B, then C)
  • Branching logic (if response X, do Y; else do Z)
  • Error handling and compensation (if step 3 fails, roll back step 2)
  • State transitions (loading → success → error)
  • Side effect coordination (avoiding race conditions)

A Simple Orchestration Pattern Implementation

Let’s refactor the checkout example using a custom hook as our orchestrator.

Step 1: Define the Orchestrator Hook

// hooks/useCheckoutOrchestrator.js
import { useReducer, useCallback } from 'react';
import { validateAddress } from '../api/address';
import { calculateShipping } from '../api/shipping';
import { applyDiscount } from '../api/discount';
import { processPayment } from '../api/payment';

// State machine for the checkout process
const initialState = {
  status: 'idle', // idle, validating, calculating, paying, success, error
  step: 1,
  cart: null,
  address: null,
  shipping: null,
  discount: null,
  paymentResult: null,
  error: null,
};

function checkoutReducer(state, action) {
  switch (action.type) {
    case 'SET_CART':
      return { ...state, cart: action.payload };
    case 'SET_ADDRESS':
      return { ...state, address: action.payload };
    case 'VALIDATION_START':
      return { ...state, status: 'validating', error: null };
    case 'VALIDATION_SUCCESS':
      return { ...state, status: 'idle', step: 2 };
    case 'VALIDATION_ERROR':
      return { ...state, status: 'error', error: action.payload };
    case 'SHIPPING_START':
      return { ...state, status: 'calculating' };
    case 'SHIPPING_SUCCESS':
      return { ...state, status: 'idle', shipping: action.payload, step: 3 };
    case 'PAYMENT_START':
      return { ...state, status: 'paying' };
    case 'PAYMENT_SUCCESS':
      return { ...state, status: 'success', paymentResult: action.payload, step: 4 };
    case 'PAYMENT_ERROR':
      return { ...state, status: 'error', error: action.payload };
    case 'RESET':
      return initialState;
    default:
      return state;
  }
}

export function useCheckoutOrchestrator() {
  const [state, dispatch] = useReducer(checkoutReducer, initialState);

  const setCart = useCallback((cart) => {
    dispatch({ type: 'SET_CART', payload: cart });
  }, []);

  const setAddress = useCallback((address) => {
    dispatch({ type: 'SET_ADDRESS', payload: address });
  }, []);

  // The orchestrator's main workflow
  const validateAndProceed = useCallback(async (address) => {
    dispatch({ type: 'VALIDATION_START' });

    try {
      // Step 1: Validate address
      const isValid = await validateAddress(address);
      if (!isValid) {
        throw new Error('Invalid address format');
      }
      dispatch({ type: 'VALIDATION_SUCCESS' });

      // Step 2: Calculate shipping based on validated address
      dispatch({ type: 'SHIPPING_START' });
      const shippingOptions = await calculateShipping(address, state.cart);
      dispatch({ type: 'SHIPPING_SUCCESS', payload: shippingOptions });

    } catch (error) {
      dispatch({ type: 'VALIDATION_ERROR', payload: error.message });
    }
  }, [state.cart]);

  const applyDiscountCode = useCallback(async (code) => {
    if (!state.cart) return;

    try {
      const discount = await applyDiscount(code, state.cart);
      dispatch({ type: 'SET_DISCOUNT', payload: discount });
      // Recalculate shipping with discount applied
      dispatch({ type: 'SHIPPING_START' });
      const updatedShipping = await calculateShipping(state.address, state.cart, discount);
      dispatch({ type: 'SHIPPING_SUCCESS', payload: updatedShipping });
    } catch (error) {
      dispatch({ type: 'SET_ERROR', payload: error.message });
    }
  }, [state.cart, state.address]);

  const processPaymentAndComplete = useCallback(async (paymentDetails) => {
    dispatch({ type: 'PAYMENT_START' });

    try {
      const result = await processPayment({
        cart: state.cart,
        shipping: state.shipping,
        discount: state.discount,
        paymentDetails,
      });

      dispatch({ type: 'PAYMENT_SUCCESS', payload: result });

      // Optional: Navigate to success page
      return result;

    } catch (error) {
      dispatch({ type: 'PAYMENT_ERROR', payload: error.message });

      // Compensation logic: if payment fails, shipping selection remains
      // but we might want to show a retry option
      throw error;
    }
  }, [state.cart, state.shipping, state.discount]);

  const reset = useCallback(() => {
    dispatch({ type: 'RESET' });
  }, []);

  return {
    // State
    status: state.status,
    step: state.step,
    cart: state.cart,
    shipping: state.shipping,
    discount: state.discount,
    error: state.error,
    paymentResult: state.paymentResult,

    // Actions (the public API of our orchestrator)
    setCart,
    setAddress,
    validateAndProceed,
    applyDiscountCode,
    processPaymentAndComplete,
    reset,
  };
}

Step 2: Consume the Orchestrator in Components

Now your components become “dumb” presentational components that simply call the orchestrator’s methods:

// CheckoutPage.jsx
import { useCheckoutOrchestrator } from '../hooks/useCheckoutOrchestrator';
import { AddressForm } from './AddressForm';
import { ShippingSelector } from './ShippingSelector';
import { PaymentForm } from './PaymentForm';
import { LoadingSpinner } from './LoadingSpinner';
import { ErrorAlert } from './ErrorAlert';

export const CheckoutPage = () => {
  const {
    status,
    step,
    cart,
    shipping,
    error,
    validateAndProceed,
    applyDiscountCode,
    processPaymentAndComplete,
    reset,
  } = useCheckoutOrchestrator();

  // Components don't need to know the complex flow!
  // They just call the orchestrator's methods.

  const handleAddressSubmit = async (addressData) => {
    await validateAndProceed(addressData);
  };

  const handleDiscountApply = async (code) => {
    await applyDiscountCode(code);
  };

  const handlePaymentSubmit = async (paymentDetails) => {
    try {
      await processPaymentAndComplete(paymentDetails);
      // Navigation happens automatically in the orchestrator
    } catch (err) {
      // Error is already in state, but we can show a toast if needed
    }
  };

  if (status === 'success') {
    return <OrderConfirmation order={paymentResult} onNewOrder={reset} />;
  }

  return (
    <div className="checkout">
      {error && <ErrorAlert message={error} onDismiss={() => reset()} />}

      {status === 'validating' || status === 'calculating' || status === 'paying' ? (
        <LoadingSpinner message="Processing your order..." />
      ) : (
        <>
          {step === 1 && (
            <AddressForm onSubmit={handleAddressSubmit} />
          )}

          {step === 2 && (
            <>
              <DiscountInput onApply={handleDiscountApply} />
              <ShippingSelector 
                options={shipping} 
                onSelect={handleShippingSelect} 
              />
              <button onClick={() => setStep(3)}>Continue to Payment</button>
            </>
          )}

          {step === 3 && (
            <PaymentForm 
              total={calculateTotal(cart, shipping, discount)}
              onSubmit={handlePaymentSubmit}
            />
          )}
        </>
      )}
    </div>
  );
};

Benefits of the Orchestration Pattern

1. Separation of Concerns

  • Components focus on presentation and user interactions
  • Orchestrator handles the “how” and “when”
  • API layers handle raw data fetching

2. Testability

Test the orchestrator in isolation without rendering UI:

test('checkout flow handles validation failure', async () => {
  const { result } = renderHook(() => useCheckoutOrchestrator());

  // Mock API to fail
  jest.spyOn(api, 'validateAddress').mockRejectedValue(new Error('Invalid'));

  await act(async () => {
    await result.current.validateAndProceed({ street: '123 Main' });
  });

  expect(result.current.status).toBe('error');
  expect(result.current.error).toBe('Invalid');
  expect(result.current.step).toBe(1); // Still on address step
});

3. Reusability

The same orchestrator can be used across different UI implementations:

  • Mobile checkout screen
  • Desktop checkout modal
  • Admin panel order creation

4. Observability

Centralized logic makes it easy to add logging, analytics, or error tracking:

const validateAndProceed = useCallback(async (address) => {
  analytics.trackEvent('checkout_address_validation_started');

  try {
    // ... validation logic
    analytics.trackEvent('checkout_address_validation_success');
  } catch (error) {
    analytics.trackEvent('checkout_address_validation_failed', { error });
    Sentry.captureException(error);
  }
}, []);

When to Use Orchestration (and When Not To)

Great Use Cases:

  • Multi-step forms (checkout, onboarding, surveys)
  • Wizard-style workflows (report generation, deployment pipelines)
  • Features with complex dependencies (dashboard with sequential data fetches)
  • Operations requiring rollback/compensation (bank transfers, reservations)

Overkill For:

  • Simple CRUD forms with one API call
  • Independent, isolated components with no coordination needs
  • Small applications where complexity doesn’t justify abstraction

Best Practices

  1. Keep orchestrators stateless where possible — store state in React state or a state machine, not in the orchestrator instance itself.

  2. Use TypeScript — define clear interfaces for your orchestrator’s context and events:

interface CheckoutContext {
  cart: Cart | null;
  address: Address | null;
  shipping: ShippingOption[] | null;
}

type CheckoutEvent = 
  | { type: 'VALIDATE_ADDRESS'; address: Address }
  | { type: 'SELECT_SHIPPING'; method: ShippingMethod }
  | { type: 'PROCESS_PAYMENT'; details: PaymentDetails };
  1. Single responsibility — an orchestrator should coordinate ONE business process. Don’t create a “god orchestrator” that handles checkout, profile updates, and notifications all in one place.

  2. Compose orchestrators — for complex apps, create smaller orchestrators that work together:

function useOrderOrchestrator() {
  const cart = useCartOrchestrator();
  const checkout = useCheckoutOrchestrator();
  const payment = usePaymentOrchestrator();

  // Compose them into a higher-level workflow
}

Conclusion

The Orchestration Pattern transforms React applications from a collection of scattered useEffect hooks and imperative logic into a clean, declarative system. By centralizing workflow coordination, you get:

  • Components that are simple and focused on presentation
  • Orchestrators that clearly express business logic
  • Code that’s easier to test, debug, and maintain

Whether you implement it with custom hooks, XState, or a full workflow engine, the principle remains the same: coordinate complexity in one place, not everywhere.

Have you used the Orchestration Pattern in your React apps? What challenges did you face? Let me know in the comments! 👇

Rebooting a Production VM on Oracle Cloud: A Reference Guide

Commands, explanations, and real output — for engineers who want to understand what’s actually happening, not just copy-paste their way through it.

☁️ Pre-Flight Checklist

Before we taxi down the runway, here’s your flight plan. Keep this handy to navigate your flight path. Welcome aboard the cloud! ☁️

🌥️ Takeoff

  • Prerequisites

⛅️ Cruising Altitude

  • Part 1 — Pre-Reboot Checklist
  • Part 2 — Running the Reboot
  • Part 3 — Post-Reboot Verification
  • Part 4 — Measuring Time to Recovery (TTR)

🌤️ Landing & Taxi

  • Quick Reference: All Commands
  • Troubleshooting Reference

Enjoy your flight! ☁️

There’s a specific kind of anxiety that comes with running sudo reboot on a server with real users on it. You know the system should come back, but “should” feels a lot less reassuring at the moment your SSH session freezes. This guide removes the guesswork. It covers everything from reading your apt upgrade output intelligently, to verifying your stack is healthy after the reboot, to measuring your actual recovery time with real commands and real numbers so that the next time you need to do this, it’s a procedure, not a gamble.

Prerequisites

This guide assumes:

  • Ubuntu 22.04 on an OCI Compute instance (ARM or x86)
  • Docker + Docker Compose managing your services
  • All long-running services configured with restart: always in your docker-compose.yml
  • SSH access to the instance

If restart: always isn’t set on your services, your containers will not come back after a reboot. Check this first.

services:
  backend:
    image: your-backend-image
    restart: always  # ✅ restarts automatically after reboot or crash

  migrations:
    image: your-migrations-image
    # no restart policy  # ✅ correct — this should run once and exit

restart: always tells Docker to relaunch the container whenever it stops — whether from a crash or a full system reboot. The one exception to be deliberate about is one-shot containers like database migrations: they’re designed to run once and exit cleanly, so no restart policy is the right call for those.

Part 1 — Pre-Reboot Checklist

Never reboot without completing this checklist. It takes under two minutes and prevents the most common post-reboot problems.

1.1 Verify no critical process is mid-flight

docker ps

What to look for:

STATUS Meaning
Up 2 days (healthy) Safe to reboot
Up 3 minutes Something recently restarted — investigate
Restarting (1) Container is crash-looping — fix before rebooting
Up 2 hours (unhealthy) Health check is failing — fix before rebooting

If everything shows Up [days/weeks] (healthy), you are clear.

Why this matters: If a database migration container is mid-run, or a background job is processing a large task, a reboot will kill it mid-execution. You want to reboot during a quiet moment.

1.2 Validate your Compose configuration

cd ~/your-project
docker compose config

Expected output: Your full resolved docker-compose.yml printed to the terminal, with no errors.

Why this matters: docker compose config resolves all environment variables and validates YAML syntax. If there’s a broken variable reference or a typo in your file, this command catches it now — not after the reboot when containers silently fail to start. A common mistake is editing a .env file or docker-compose.yml and not realising you’ve introduced a syntax error. This is your safety net.

1.3 Read your apt upgrade output

When you run sudo apt update && sudo apt upgrade -y before a reboot, the output tells you exactly what changed on your system. Don’t skip past it.

Here’s a real upgrade output and what each part means:

The following packages will be upgraded:
  containerd.io coreutils docker-ce docker-ce-cli
  docker-ce-rootless-extras docker-compose-plugin docker-model-plugin
  gitlab-runner gitlab-runner-helper-images libnftables1 nftables
  python3-pyasn1

How to read this list:

Package What it is Reboot needed?
docker-ce, containerd.io, docker-ce-cli The Docker engine and its runtime Recommended
docker-compose-plugin The docker compose CLI plugin No
nftables, libnftables1 Linux kernel firewall/networking Yes
coreutils Fundamental Linux utilities (ls, cp, etc.) Recommended
gitlab-runner, gitlab-runner-helper-images CI/CD runner agent Service restarts during upgrade
python3-pyasn1 Python crypto library No

The rule of thumb: If the upgrade touches anything in the kernel, networking stack, or container runtime — reboot. If it’s only application-level packages — a reboot is optional but never harmful.

1.4 Understand the service restart messages

After apt upgrade, Ubuntu’s needrestart tool prints which services were restarted automatically and which were deferred:

Restarting services...
 systemctl restart irqbalance.service ssh.service rsyslog.service ...

Service restarts being deferred:
 systemctl restart networkd-dispatcher.service
 systemctl restart systemd-logind.service

“Restarting services” — These were restarted immediately. Your SSH connection stayed alive because ssh.service restarts in-place without dropping existing sessions.

“Service restarts being deferred” — These require a full reboot to apply safely. systemd-logind manages user sessions; restarting it mid-session can cause issues, so Ubuntu defers it to the next clean boot.

No containers need to be restarted.

This line means Docker detected that running container images are still current — no container needed to be replaced. This is expected if you haven’t rebuilt your application images.

1.5 Check available disk space

df -h /

Example output:

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        48G   12G   36G  23% /

You want at least 20% free on your root partition. Docker image pulls and accumulated log files are the two most common causes of a full disk, which can prevent containers from starting after a reboot.

Tip: The apt upgrade process often reclaims space automatically by pruning unused Docker build cache layers. In a real upgrade run, this printed:

Total reclaimed space: 4.165GB

Part 2 — Running the Reboot

Once the checklist is complete:

sudo reboot

What happens next, step by step:

  1. The OS sends SIGTERM to all running processes, giving them time to shut down cleanly.
  2. Docker receives the signal and stops all containers gracefully.
  3. The kernel shuts down and the VM restarts.
  4. Your SSH session prints Connection to [ip] closed by remote host. and terminates. This is normal.

How long to wait: OCI ARM instances (Ampere A1) typically reboot in 45–90 seconds. Wait at least 60 seconds before trying to reconnect.

ssh -i ~/.ssh/id_rsa ubuntu@YOUR_IP

Part 3 — Post-Reboot Verification

Run these checks in order. Each one builds on the last.

3.1 Check the Docker daemon

sudo systemctl status docker

Expected output:

 docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled)
     Active: active (running) since Mon 2026-03-30 15:55:51 UTC; 5min ago

Key things to check:

  • Active: active (running) — the daemon is running ✅
  • enabled — it is configured to auto-start on every future boot ✅

If the daemon isn’t running:

sudo systemctl enable docker   # ensure it starts on future reboots
sudo systemctl start docker    # start it now

3.2 Check all containers are up

docker ps

Example output:

CONTAINER ID   IMAGE              COMMAND        CREATED      STATUS                  PORTS    NAMES
fc46f84c7bd5   app-backend        "uv run uvi…"  2 days ago   Up 5 minutes (healthy)  8000/tcp app_backend
a3e9a2eeb160   redis:alpine       "docker-ent…"  2 weeks ago  Up 5 minutes (healthy)  6379/tcp app_redis
f4afe2edb00c   caddy:alpine       "caddy run …"  4 weeks ago  Up 5 minutes (healthy)  80, 443  caddy_proxy

What to check:

  • Every service you expect should be present. If one is missing, it crashed on startup.
  • STATUS should be Up or Up (healthy). (health: starting) is fine for the first 30 seconds after boot.
  • The CREATED timestamp does not reset on reboot — it reflects when the container was first created with docker compose up. This is normal.

If a container is missing or in a restart loop:

docker compose logs [service_name] --tail=50

This shows the last 50 log lines for that specific service, which will usually tell you exactly why it failed.

3.3 Watch the live logs

cd ~/your-project
docker compose logs -f --tail=20

The -f flag follows the log stream in real time. --tail=20 shows the last 20 lines per service as a starting point.

What healthy output looks like:

app_gate    | 127.0.0.1 - - [30/Mar/2026:16:00:00 +0000] "GET / HTTP/1.1" 200 4140
app_backend | INFO: 127.0.0.1:58562 - "GET /health HTTP/1.1" 200 OK
caddy_proxy | {"level":"info","msg":"received request","uri":"/config/"}
app_redis   | * Ready to accept connections tcp

What a transient (non-critical) error looks like:

app_worker | redis.exceptions.ConnectionError: Error while reading
            from redis:6379 : (104, 'Connection reset by peer')
app_worker | 15:56:15: Starting worker for 1 functions: process_message
app_worker | 15:56:15: redis_version=8.6.1 mem_usage=1.38M clients_connected=1

This pattern — an error followed immediately by a successful connection message — is normal during cold starts. When all containers launch simultaneously, a dependent service (like a worker) may attempt its first connection before its dependency (like Redis) has finished initialising. The container retries and connects successfully on the next attempt. This is expected behavior.

What a critical error looks like:

app_backend | sqlalchemy.exc.OperationalError: connection refused
app_backend | [after 5 retries] giving up

A critical error is one that does not resolve on its own. If you see continuous errors without a recovery line following them, press Ctrl+C and investigate that service.

3.4 Check additional system services

If you run a CI/CD runner or similar agent alongside Docker:

sudo gitlab-runner status

Expected output:

gitlab-runner: Service is running

If it’s not running:

sudo gitlab-runner start

Part 4 — Measuring Time to Recovery (TTR)

TTR is the total time from sudo reboot to the moment your application is serving healthy responses. Measuring it gives you accurate data for maintenance window planning and user communications.

4.1 Measure OS boot time

systemd-analyze

Example output:

Startup finished in 3.617s (kernel) + 19.608s (userspace) = 23.225s
graphical.target reached after 18.845s in userspace

Breaking this down:

Phase Time What’s happening
Kernel 3.6s The Linux kernel loads into memory and initialises hardware drivers
Userspace 19.6s All systemd services start in parallel (networking, Docker, SSH, etc.)
Total 23.2s OS is fully booted

4.2 Find the bottleneck in the boot sequence

systemd-analyze blame | head -20

This lists every service sorted by how long it took to start, slowest first:

12.186s docker.service
 4.821s cloud-init.service
 1.204s snapd.service
   38ms docker.socket

In this case, Docker itself accounted for 12 of the 23 total seconds. This is normal — Docker has to read its state from disk, re-attach networks, and prepare to launch containers.

Why this is useful: If your boot time is unexpectedly long, systemd-analyze blame tells you exactly which service is the bottleneck.

4.3 Find the exact moment containers started

docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)

Example output:

/app_ftp_bridge:  2026-03-30T15:55:57.766Z
/app_worker:      2026-03-30T15:55:57.695Z
/app_backend:     2026-03-30T15:55:57.646Z
/app_gate:        2026-03-30T15:55:57.830Z
/app_admin:       2026-03-30T15:55:57.742Z
/app_redis:       2026-03-30T15:55:57.794Z
/caddy_proxy:     2026-03-30T15:55:57.615Z

Every container launched within the same second. This is because Docker starts all containers in parallel as soon as the daemon is ready. Note: this timestamp reflects when Docker launched the container process, not when the application inside it was ready to serve traffic. A container may take a further 5–30 seconds to pass its health check after this point.

4.4 Build your full TTR timeline

Combining the data from the above commands:

Event Time (relative to reboot)
sudo reboot executed T+0s
SSH connection closed T+~5s
Kernel boot complete T+~8s
Userspace boot complete (OS ready) T+~28s
Docker daemon ready T+~28s (12s of the userspace phase)
All containers launched T+~28s
Redis accepting connections T+~30s
Backend /health returning 200 T+~35s
All health checks passing T+~55s
Total TTR ~55–60 seconds

4.5 Use TTR to plan user communications

With a measured TTR, you can set honest expectations.

Internal / engineering team:

“Maintenance reboot at [time]. Expected downtime: ~2 minutes.”

The 2-minute internal window gives a buffer above the measured ~60 seconds for anything unexpected.

External users:

“Scheduled maintenance in progress. Services will be restored within 5 minutes.”

The 5-minute external window is deliberately conservative. If a container fails its first health check and requires a full restart cycle (up to 5 retries × 5 seconds = 25 extra seconds), you’re still within your stated window. Under-promise, over-deliver.

Quick Reference: All Commands

# --- PRE-REBOOT ---
docker ps                              # check container states
docker compose config                  # validate compose file syntax
df -h /                                # check available disk space

# --- REBOOT ---
sudo reboot                            # initiate the reboot

# --- POST-REBOOT ---
sudo systemctl status docker           # confirm daemon is running
docker ps                              # confirm containers are up
docker compose logs -f --tail=20       # watch live logs
sudo gitlab-runner status              # check runner (if applicable)

# --- TTR MEASUREMENT ---
systemd-analyze                        # total OS boot time
systemd-analyze blame | head -20       # per-service boot time breakdown
docker inspect --format='{{.Name}}: {{.State.StartedAt}}' $(docker ps -q)
                                       # exact container start timestamps

Troubleshooting Reference

Symptom Likely cause Fix
Container missing from docker ps Crashed on startup docker compose logs [service] --tail=50
Container stuck in (health: starting) after 2+ minutes Health check command failing docker inspect [id] → check Health.Log
Docker daemon not running Not enabled in systemd sudo systemctl enable docker && sudo systemctl start docker
SSH times out for more than 3 minutes VM didn’t boot cleanly Check OCI console → instance serial console for kernel panic output
All containers up but app unreachable externally Reverse proxy (Caddy/Nginx) issue docker compose logs caddy --tail=50
Persistent container errors after cold start Dependency started before its dependency was ready Wait 60 seconds, then re-check — most resolve automatically

Cover photo by BoliviaInteligente on Unsplash

GenAIOps on AWS: End-to-End Observability Stack – Part 3

Reading time: ~22-25 minutes

Level: Intermediate to Advanced

Series: Part 3 of 4 – End-to-End Observability

What you’ll learn: Build comprehensive observability for GenAI systems with CloudWatch GenAI Observability, X-Ray distributed tracing, and custom metrics

The Problem: When GenAI Goes Wrong at 3 AM

It’s 3 AM. PagerDuty wakes you up:

You open your logs. 10,000 lines of JSON. Where do you start?

Everything returns 200. But users are complaining. What’s actually failing?

  • Is retrieval slow? Can’t tell from these logs
  • Is the LLM hallucinating? No quality metrics captured
  • Why is cost 5x higher? Token counts missing
  • Which model is being used? Not tracked
  • What context was retrieved? Lost in the void

Traditional observability wasn’t built for this. You need GenAI-specific observability that captures the full story: retrieval quality, token consumption, model behavior, and end-to-end traces showing exactly where things break.

This is what we’re building today.

The GenAI Observability Challenge

GenAI systems are fundamentally different from traditional microservices:

Traditional Microservice Request

GenAI System Request

The challenge: A request can succeed (200 OK) but still fail the user:

  • Retrieved wrong documents → bad answer
  • LLM hallucinated → user misinformed
  • Cost spiked 5x → budget blown
  • Latency is 8s → user abandoned request

Traditional observability captures success/failure. GenAI observability captures quality/cost/performance at every step.

AWS CloudWatch GenAI Observability

AWS launched CloudWatch GenAI Observability in preview (Q4 2024) and GA (October 2025). It’s purpose-built for LLM applications.

What It Provides Out-of-the-Box

1. Model Invocation Dashboard

Automatic tracking of:

  • Invocation metrics: Count, success rate, throttles
  • Token metrics: Input tokens, output tokens, total tokens
  • Cost attribution: Per-model, per-request costs
  • Latency breakdown: Time-to-first-token, generation latency
  • Error tracking: Model errors, throttling, timeouts

2. AgentCore Agent Dashboard

For Amazon Bedrock AgentCore agents:

  • Session tracking: Duration, turn count, completion
  • Tool usage: Which tools called, frequency, success rate
  • Memory operations: Reads, writes, retrieval performance
  • Gateway metrics: API latency, auth failures
  • Reasoning traces: Step-by-step agent decision logs

3. OpenTelemetry Integration

  • Distributed tracing: End-to-end request flows
  • Custom spans: Instrument your components
  • Automatic instrumentation: AWS SDK calls auto-traced
  • X-Ray integration: Service maps and bottleneck detection

Architecture: Complete Observability Stack

Setting Up OpenTelemetry with ADOT

AWS Distro for OpenTelemetry (ADOT) is AWS’s distribution of OpenTelemetry, pre-configured for AWS services.

Installation

Basic Configuration

Auto-Instrumentation Setup

Instrumenting Your RAG Application

Now let’s instrument a complete RAG pipeline:

# instrumented_rag_system.py
import boto3
import json
from typing import List, Dict
from opentelemetry import trace
from datetime import datetime

class InstrumentedRAGSystem:
    """
    Fully instrumented RAG system with distributed tracing

    Captures:
    - End-to-end request traces
    - Per-component latency
    - Token consumption and costs
    - Quality signals
    - Error details
    """

    def __init__(self):
        self.bedrock_runtime = boto3.client('bedrock-runtime')
        self.opensearch = boto3.client('opensearchserverless')
        self.cloudwatch = boto3.client('cloudwatch')

        # Get tracer
        self.tracer = trace.get_tracer(__name__)

        # Model pricing (per 1K tokens)
        self.pricing = {
            "anthropic.claude-sonnet-4-20250514": {
                "input": 0.003,
                "output": 0.015
            },
            "amazon.titan-embed-text-v2:0": {
                "input": 0.0001,
                "output": 0
            }
        }

    def query(self, user_query: str, user_id: str = None) -> Dict:
        """
        Process RAG query with full instrumentation

        Args:
            user_query: User's question
            user_id: Optional user identifier for tracking

        Returns:
            Dict with answer and metadata
        """

        # Start root span
        with self.tracer.start_as_current_span("rag_query") as root_span:

            # Add request attributes
            root_span.set_attribute("query", user_query)
            root_span.set_attribute("query_length", len(user_query))
            if user_id:
                root_span.set_attribute("user_id", user_id)
            root_span.set_attribute("timestamp", datetime.now().isoformat())

            try:
                # Step 1: Generate embeddings
                with self.tracer.start_as_current_span("generate_embeddings") as span:
                    embeddings, embed_cost = self._generate_embeddings(user_query)

                    span.set_attribute("embedding_dimension", len(embeddings))
                    span.set_attribute("embedding_cost_usd", embed_cost)
                    span.set_attribute("model", "amazon.titan-embed-text-v2:0")

                # Step 2: Vector search
                with self.tracer.start_as_current_span("vector_search") as span:
                    contexts = self._vector_search(embeddings, top_k=5)

                    span.set_attribute("documents_retrieved", len(contexts))
                    if contexts:
                        avg_score = sum(c['score'] for c in contexts) / len(contexts)
                        span.set_attribute("avg_similarity_score", round(avg_score, 3))
                        span.set_attribute("top_score", round(contexts[0]['score'], 3))

                    # Publish retrieval quality metric
                    self._publish_metric(
                        "RetrievalQuality",
                        avg_score if contexts else 0,
                        namespace="GenAI/RAG/Retrieval"
                    )

                # Step 3: Rerank (optional but recommended)
                with self.tracer.start_as_current_span("rerank_documents") as span:
                    contexts = self._rerank_contexts(user_query, contexts, top_k=3)

                    span.set_attribute("documents_after_rerank", len(contexts))
                    if contexts:
                        span.set_attribute("top_rerank_score", round(contexts[0]['rerank_score'], 3))

                # Step 4: Build prompt and count tokens
                with self.tracer.start_as_current_span("prompt_construction") as span:
                    prompt = self._build_prompt(user_query, contexts)
                    input_tokens = self._estimate_tokens(prompt)

                    span.set_attribute("input_tokens", input_tokens)
                    span.set_attribute("context_documents", len(contexts))
                    span.set_attribute("prompt_length_chars", len(prompt))

                    # Check context window
                    max_context_window = 200000  # Claude Sonnet 4
                    if input_tokens > max_context_window:
                        span.set_attribute("error", "context_window_exceeded")
                        raise ValueError(f"Input tokens ({input_tokens}) exceed context window")

                # Step 5: Generate response
                with self.tracer.start_as_current_span("llm_generation") as span:
                    response = self._generate_response(prompt)

                    # Extract metrics
                    usage = response.get('usage', {})
                    input_tokens = usage.get('input_tokens', 0)
                    output_tokens = usage.get('output_tokens', 0)
                    model_id = "anthropic.claude-sonnet-4-20250514"

                    # Calculate cost
                    cost = self._calculate_cost(
                        model_id=model_id,
                        input_tokens=input_tokens,
                        output_tokens=output_tokens
                    )

                    # Add span attributes
                    span.set_attribute("model_id", model_id)
                    span.set_attribute("input_tokens", input_tokens)
                    span.set_attribute("output_tokens", output_tokens)
                    span.set_attribute("total_tokens", input_tokens + output_tokens)
                    span.set_attribute("generation_cost_usd", cost)
                    span.set_attribute("stop_reason", response.get('stop_reason', 'unknown'))

                    # Publish token metrics
                    self._publish_metric("InputTokens", input_tokens, namespace="GenAI/Tokens")
                    self._publish_metric("OutputTokens", output_tokens, namespace="GenAI/Tokens")
                    self._publish_metric("GenerationCost", cost, namespace="GenAI/Cost")

                # Step 6: Extract answer
                answer = response['content'][0]['text']

                # Add overall metrics to root span
                total_cost = embed_cost + cost
                root_span.set_attribute("total_cost_usd", round(total_cost, 4))
                root_span.set_attribute("total_tokens", input_tokens + output_tokens)
                root_span.set_attribute("answer_length", len(answer))
                root_span.set_attribute("status", "success")

                # Publish overall metrics
                self._publish_metric("RequestCost", total_cost, namespace="GenAI/Cost")
                self._publish_metric("RequestSuccess", 1, namespace="GenAI/Quality")

                return {
                    "answer": answer,
                    "metadata": {
                        "input_tokens": input_tokens,
                        "output_tokens": output_tokens,
                        "total_cost": round(total_cost, 4),
                        "contexts_used": len(contexts),
                        "model": model_id
                    }
                }

            except Exception as e:
                # Capture error in span
                root_span.set_attribute("error", True)
                root_span.set_attribute("error_type", type(e).__name__)
                root_span.set_attribute("error_message", str(e))
                root_span.set_attribute("status", "error")

                # Publish error metric
                self._publish_metric("RequestErrors", 1, namespace="GenAI/Errors")

                # Re-raise
                raise

    def _generate_embeddings(self, text: str) -> tuple:
        """Generate embeddings with Bedrock Titan"""

        response = self.bedrock_runtime.invoke_model(
            modelId="amazon.titan-embed-text-v2:0",
            body=json.dumps({
                "inputText": text,
                "dimensions": 1024,
                "normalize": True
            })
        )

        result = json.loads(response['body'].read())
        embeddings = result['embedding']

        # Calculate cost
        token_count = len(text.split()) * 1.3  # Rough estimate
        cost = (token_count / 1000) * self.pricing["amazon.titan-embed-text-v2:0"]["input"]

        return embeddings, cost

    def _vector_search(self, embeddings: List[float], top_k: int = 5) -> List[Dict]:
        """
        Search OpenSearch vector index

        Note: This is automatically traced via boto3 instrumentation
        """

        # OpenSearch vector search
        # In production, use actual OpenSearch client

        # Mock response for example
        return [
            {
                "id": "doc_1",
                "score": 0.89,
                "text": "Electronics can be returned within 30 days..."
            },
            {
                "id": "doc_2",
                "score": 0.76,
                "text": "Damaged items require photo documentation..."
            },
            {
                "id": "doc_3",
                "score": 0.71,
                "text": "Restocking fees apply to opened electronics..."
            }
        ]

    def _rerank_contexts(
        self,
        query: str,
        contexts: List[Dict],
        top_k: int = 3
    ) -> List[Dict]:
        """
        Rerank contexts using cross-encoder

        In production, use:
        - Bedrock reranking model
        - Cohere rerank
        - Custom cross-encoder
        """

        # For example, just return top contexts
        # In production, apply reranking model
        for ctx in contexts[:top_k]:
            ctx['rerank_score'] = ctx['score'] * 1.1  # Mock rerank

        return contexts[:top_k]

    def _build_prompt(self, query: str, contexts: List[Dict]) -> str:
        """Build prompt from query and contexts"""

        context_text = "nn".join([
            f"Document {i+1}:n{ctx['text']}"
            for i, ctx in enumerate(contexts)
        ])

        prompt = f"""You are a helpful customer service assistant. Answer the user's question based on the provided context.

Context:
{context_text}

Question: {query}

Answer the question using only information from the context. If the context doesn't contain enough information, say so."""

        return prompt

    def _estimate_tokens(self, text: str) -> int:
        """Rough token estimation"""
        # 1 token ≈ 0.75 words for English
        return int(len(text.split()) * 1.3)

    def _generate_response(self, prompt: str) -> Dict:
        """Generate response with Bedrock"""

        response = self.bedrock_runtime.invoke_model(
            modelId="anthropic.claude-sonnet-4-20250514",
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 2048,
                "temperature": 0.7,
                "messages": [
                    {"role": "user", "content": prompt}
                ]
            })
        )

        return json.loads(response['body'].read())

    def _calculate_cost(
        self,
        model_id: str,
        input_tokens: int,
        output_tokens: int
    ) -> float:
        """Calculate request cost"""

        pricing = self.pricing.get(model_id, {"input": 0, "output": 0})

        cost = (
            (input_tokens / 1000) * pricing["input"] +
            (output_tokens / 1000) * pricing["output"]
        )

        return cost

    def _publish_metric(
        self,
        metric_name: str,
        value: float,
        namespace: str = "GenAI/Custom"
    ):
        """Publish custom metric to CloudWatch"""

        try:
            self.cloudwatch.put_metric_data(
                Namespace=namespace,
                MetricData=[
                    {
                        'MetricName': metric_name,
                        'Value': value,
                        'Unit': 'None',
                        'Timestamp': datetime.now()
                    }
                ]
            )
        except Exception as e:
            # Don't fail request if metric publishing fails
            print(f"Warning: Failed to publish metric {metric_name}: {e}")

# Usage
rag_system = InstrumentedRAGSystem()

response = rag_system.query(
    user_query="What's the return policy for damaged electronics?",
    user_id="user_12345"
)

print(f"Answer: {response['answer']}")
print(f"Cost: ${response['metadata']['total_cost']}")
print(f"Tokens: {response['metadata']['total_tokens']}")

AWS X-Ray Integration

X-Ray provides the service map and bottleneck detection that traces alone can’t give you.

Enabling X-Ray Active Tracing

Lambda Function:

Terraform Configuration:

Custom X-Ray Segments

# custom_xray_segments.py
from aws_xray_sdk.core import xray_recorder

class XRayInstrumentedRAG:
    """RAG system with custom X-Ray segments"""

    def query(self, user_query: str):
        """Process query with custom segments"""

        # Retrieval segment
        with xray_recorder.capture('retrieval') as segment:
            contexts = self._retrieve_contexts(user_query)

            # Add annotations (indexed for filtering)
            segment.put_annotation('documents_found', len(contexts))
            segment.put_annotation('avg_relevance', 
                                  sum(c['score'] for c in contexts) / len(contexts))

            # Add metadata (not indexed)
            segment.put_metadata('retrieval_method', 'vector_search')
            segment.put_metadata('top_documents', [c['id'] for c in contexts[:3]])

        # Generation segment
        with xray_recorder.capture('generation') as segment:
            response = self._generate(user_query, contexts)

            # Annotations
            segment.put_annotation('input_tokens', response['input_tokens'])
            segment.put_annotation('output_tokens', response['output_tokens'])
            segment.put_annotation('cost_usd', response['cost'])

            # Metadata
            segment.put_metadata('model_id', response['model_id'])
            segment.put_metadata('stop_reason', response['stop_reason'])

        return response

X-Ray Service Map Insights

X-Ray automatically generates service maps showing:

Building Comprehensive CloudWatch Dashboards

Create unified dashboards showing the full picture:

# comprehensive_dashboard.py
import boto3
import json
from typing import Dict, List

class GenAIDashboardBuilder:
    """Build comprehensive CloudWatch dashboards for GenAI systems"""

    def __init__(self):
        self.cloudwatch = boto3.client('cloudwatch')

    def create_production_dashboard(self) -> str:
        """
        Create production-grade dashboard with:
        - Quality metrics
        - Performance metrics
        - Cost tracking
        - Error monitoring
        - User satisfaction
        """

        dashboard_body = {
            "widgets": self._build_all_widgets()
        }

        response = self.cloudwatch.put_dashboard(
            DashboardName='GenAI-Production-Observability',
            DashboardBody=json.dumps(dashboard_body)
        )

        dashboard_url = (
            f"https://console.aws.amazon.com/cloudwatch/home"
            f"?region=us-east-1#dashboards:name=GenAI-Production-Observability"
        )

        print(f"✓ Dashboard created: {dashboard_url}")
        return dashboard_url

    def _build_all_widgets(self) -> List[Dict]:
        """Build all dashboard widgets"""

        widgets = []

        # Row 1: Quality Metrics (0, 0)
        widgets.append(self._quality_metrics_widget(x=0, y=0))
        widgets.append(self._quality_distribution_widget(x=12, y=0))

        # Row 2: Performance Metrics (0, 6)
        widgets.append(self._latency_breakdown_widget(x=0, y=6))
        widgets.append(self._throughput_widget(x=12, y=6))

        # Row 3: Cost & Tokens (0, 12)
        widgets.append(self._cost_metrics_widget(x=0, y=12))
        widgets.append(self._token_usage_widget(x=8, y=12))
        widgets.append(self._cost_per_user_widget(x=16, y=12))

        # Row 4: Errors & Alerts (0, 18)
        widgets.append(self._error_rate_widget(x=0, y=18))
        widgets.append(self._error_breakdown_widget(x=8, y=18))
        widgets.append(self._recent_errors_log_widget(x=16, y=18))

        # Row 5: Model Performance (0, 24)
        widgets.append(self._model_comparison_widget(x=0, y=24))
        widgets.append(self._stop_reasons_widget(x=12, y=24))

        # Row 6: User Experience (0, 30)
        widgets.append(self._user_satisfaction_widget(x=0, y=30))
        widgets.append(self._session_metrics_widget(x=12, y=30))

        # Row 7: X-Ray Service Map (0, 36)
        widgets.append(self._xray_service_map_widget(x=0, y=36))

        return widgets

    def _quality_metrics_widget(self, x: int, y: int) -> Dict:
        """Real-time quality metrics"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Quality", "Faithfulness", {
                        "stat": "Average",
                        "label": "Faithfulness"
                    }],
                    [".", "AnswerRelevancy", {
                        "stat": "Average",
                        "label": "Relevancy"
                    }],
                    [".", "ContextPrecision", {
                        "stat": "Average",
                        "label": "Context Precision"
                    }],
                    [".", "ContextRecall", {
                        "stat": "Average",
                        "label": "Context Recall"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "📊 RAG Quality Metrics",
                "period": 300,
                "yAxis": {
                    "left": {
                        "min": 0,
                        "max": 1,
                        "label": "Score"
                    }
                },
                "annotations": {
                    "horizontal": [
                        {
                            "value": 0.85,
                            "label": "Target",
                            "color": "#2ca02c"
                        },
                        {
                            "value": 0.75,
                            "label": "Warning",
                            "color": "#ff7f0e"
                        },
                        {
                            "value": 0.60,
                            "label": "Critical",
                            "color": "#d62728"
                        }
                    ]
                }
            }
        }

    def _quality_distribution_widget(self, x: int, y: int) -> Dict:
        """Quality score distribution"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Quality", "Faithfulness", {
                        "stat": "p50",
                        "label": "P50"
                    }],
                    ["...", {
                        "stat": "p90",
                        "label": "P90"
                    }],
                    ["...", {
                        "stat": "p99",
                        "label": "P99"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "📈 Faithfulness Distribution (P50/P90/P99)",
                "period": 300
            }
        }

    def _latency_breakdown_widget(self, x: int, y: int) -> Dict:
        """Latency breakdown by component"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Performance", "EmbeddingLatency", {
                        "stat": "Average",
                        "label": "Embeddings"
                    }],
                    [".", "VectorSearchLatency", {
                        "stat": "Average",
                        "label": "Vector Search"
                    }],
                    [".", "RerankLatency", {
                        "stat": "Average",
                        "label": "Reranking"
                    }],
                    [".", "GenerationLatency", {
                        "stat": "Average",
                        "label": "LLM Generation"
                    }],
                    [".", "EndToEndLatency", {
                        "stat": "Average",
                        "label": "Total",
                        "color": "#1f77b4"
                    }]
                ],
                "view": "timeSeries",
                "stacked": True,
                "region": "us-east-1",
                "title": "⚡ Latency Breakdown (Stacked)",
                "period": 300,
                "yAxis": {
                    "left": {
                        "label": "Milliseconds"
                    }
                }
            }
        }

    def _throughput_widget(self, x: int, y: int) -> Dict:
        """Request throughput"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Throughput", "RequestCount", {
                        "stat": "Sum",
                        "label": "Total Requests"
                    }],
                    [".", "SuccessfulRequests", {
                        "stat": "Sum",
                        "label": "Successful"
                    }],
                    [".", "FailedRequests", {
                        "stat": "Sum",
                        "label": "Failed"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "🔄 Request Throughput",
                "period": 300
            }
        }

    def _cost_metrics_widget(self, x: int, y: int) -> Dict:
        """Cost tracking"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Cost", "TotalCost", {
                        "stat": "Sum",
                        "label": "Total Cost"
                    }],
                    [".", "EmbeddingCost", {
                        "stat": "Sum",
                        "label": "Embeddings"
                    }],
                    [".", "GenerationCost", {
                        "stat": "Sum",
                        "label": "Generation"
                    }]
                ],
                "view": "timeSeries",
                "stacked": True,
                "region": "us-east-1",
                "title": "💰 Cost Breakdown (USD)",
                "period": 300,
                "yAxis": {
                    "left": {
                        "label": "USD"
                    }
                }
            }
        }

    def _token_usage_widget(self, x: int, y: int) -> Dict:
        """Token consumption"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Tokens", "InputTokens", {
                        "stat": "Sum",
                        "label": "Input Tokens"
                    }],
                    [".", "OutputTokens", {
                        "stat": "Sum",
                        "label": "Output Tokens"
                    }]
                ],
                "view": "timeSeries",
                "stacked": True,
                "region": "us-east-1",
                "title": "🎫 Token Usage",
                "period": 300
            }
        }

    def _cost_per_user_widget(self, x: int, y: int) -> Dict:
        """Cost per user/query"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Cost", "CostPerQuery", {
                        "stat": "Average",
                        "label": "Avg per Query"
                    }],
                    ["...", {
                        "stat": "p95",
                        "label": "P95 per Query"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "💵 Cost Per Query",
                "period": 300,
                "yAxis": {
                    "left": {
                        "label": "USD"
                    }
                }
            }
        }

    def _error_rate_widget(self, x: int, y: int) -> Dict:
        """Error rate tracking"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Errors", "ErrorRate", {
                        "stat": "Average",
                        "label": "Error Rate %"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "❌ Error Rate",
                "period": 300,
                "yAxis": {
                    "left": {
                        "label": "Percentage",
                        "min": 0,
                        "max": 100
                    }
                },
                "annotations": {
                    "horizontal": [
                        {
                            "value": 1,
                            "label": "Target < 1%",
                            "color": "#2ca02c"
                        },
                        {
                            "value": 5,
                            "label": "Critical > 5%",
                            "color": "#d62728"
                        }
                    ]
                }
            }
        }

    def _error_breakdown_widget(self, x: int, y: int) -> Dict:
        """Error breakdown by type"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Errors", "RetrievalErrors", {
                        "stat": "Sum",
                        "label": "Retrieval"
                    }],
                    [".", "GenerationErrors", {
                        "stat": "Sum",
                        "label": "Generation"
                    }],
                    [".", "ThrottlingErrors", {
                        "stat": "Sum",
                        "label": "Throttling"
                    }],
                    [".", "ValidationErrors", {
                        "stat": "Sum",
                        "label": "Validation"
                    }]
                ],
                "view": "timeSeries",
                "stacked": True,
                "region": "us-east-1",
                "title": "🔍 Error Breakdown",
                "period": 300
            }
        }

    def _recent_errors_log_widget(self, x: int, y: int) -> Dict:
        """Recent errors from logs"""
        return {
            "type": "log",
            "x": x,
            "y": y,
            "width": 8,
            "height": 6,
            "properties": {
                "query": """
                SOURCE '/aws/lambda/rag-api'
                | fields @timestamp, @message, error_type, request_id
                | filter @message like /ERROR/
                | sort @timestamp desc
                | limit 20
                """,
                "region": "us-east-1",
                "title": "📋 Recent Errors",
                "view": "table"
            }
        }

    def _model_comparison_widget(self, x: int, y: int) -> Dict:
        """Compare model performance"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Models", "AvgLatency", {
                        "stat": "Average",
                        "dimensions": {"ModelId": "claude-sonnet-4"}
                    }],
                    ["...", {
                        "dimensions": {"ModelId": "claude-opus-4"}
                    }],
                    ["...", {
                        "dimensions": {"ModelId": "claude-haiku-4"}
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "🤖 Model Latency Comparison",
                "period": 300
            }
        }

    def _stop_reasons_widget(self, x: int, y: int) -> Dict:
        """LLM stop reasons distribution"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Behavior", "StopReason", {
                        "stat": "SampleCount",
                        "dimensions": {"Reason": "end_turn"}
                    }],
                    ["...", {
                        "dimensions": {"Reason": "max_tokens"}
                    }],
                    ["...", {
                        "dimensions": {"Reason": "stop_sequence"}
                    }]
                ],
                "view": "timeSeries",
                "stacked": True,
                "region": "us-east-1",
                "title": "🛑 Stop Reasons",
                "period": 300
            }
        }

    def _user_satisfaction_widget(self, x: int, y: int) -> Dict:
        """User feedback scores"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/UserExperience", "FeedbackScore", {
                        "stat": "Average",
                        "label": "Avg Satisfaction"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "⭐ User Satisfaction (1-5)",
                "period": 300,
                "yAxis": {
                    "left": {
                        "min": 1,
                        "max": 5
                    }
                }
            }
        }

    def _session_metrics_widget(self, x: int, y: int) -> Dict:
        """Session-level metrics"""
        return {
            "type": "metric",
            "x": x,
            "y": y,
            "width": 12,
            "height": 6,
            "properties": {
                "metrics": [
                    ["GenAI/Sessions", "AvgSessionDuration", {
                        "stat": "Average",
                        "label": "Avg Duration (s)"
                    }],
                    [".", "AvgTurnsPerSession", {
                        "stat": "Average",
                        "label": "Avg Turns"
                    }]
                ],
                "view": "timeSeries",
                "stacked": False,
                "region": "us-east-1",
                "title": "💬 Session Metrics",
                "period": 300
            }
        }

    def _xray_service_map_widget(self, x: int, y: int) -> Dict:
        """X-Ray service map"""
        return {
            "type": "trace",
            "x": x,
            "y": y,
            "width": 24,
            "height": 8,
            "properties": {
                "title": "🗺️ X-Ray Service Map - RAG System",
                "region": "us-east-1"
            }
        }

# Create dashboard
builder = GenAIDashboardBuilder()
dashboard_url = builder.create_production_dashboard()

Alarming Strategy for GenAI Systems

Set up intelligent alarms that catch real issues:

# genai_alarms.py
import boto3
from typing import Dict, List

class GenAIAlarmManager:
    """Comprehensive alarming for GenAI systems"""

    def __init__(self, sns_topic_arn: str):
        self.cloudwatch = boto3.client('cloudwatch')
        self.sns_topic_arn = sns_topic_arn

    def create_all_alarms(self):
        """Create complete alarm suite"""

        alarms = [
            # Quality alarms
            self._quality_degradation_alarm(),
            self._faithfulness_critical_alarm(),

            # Performance alarms
            self._high_latency_alarm(),
            self._latency_spike_alarm(),

            # Cost alarms
            self._cost_spike_alarm(),
            self._daily_budget_alarm(),

            # Error alarms
            self._high_error_rate_alarm(),
            self._retrieval_failure_alarm(),

            # Composite alarms
            self._system_degraded_composite_alarm()
        ]

        for alarm_config in alarms:
            self.cloudwatch.put_metric_alarm(**alarm_config)
            print(f"✓ Created alarm: {alarm_config['AlarmName']}")

    def _quality_degradation_alarm(self) -> Dict:
        """Alert when quality metrics drop"""
        return {
            'AlarmName': 'RAG-Quality-Degradation',
            'ComparisonOperator': 'LessThanThreshold',
            'EvaluationPeriods': 2,
            'DatapointsToAlarm': 2,  # 2 out of 2
            'MetricName': 'Faithfulness',
            'Namespace': 'GenAI/Quality',
            'Period': 300,
            'Statistic': 'Average',
            'Threshold': 0.75,
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Faithfulness score dropped below 0.75 for 10 minutes',
            'TreatMissingData': 'notBreaching'
        }

    def _faithfulness_critical_alarm(self) -> Dict:
        """Critical alarm for severe quality drop"""
        return {
            'AlarmName': 'RAG-Faithfulness-Critical',
            'ComparisonOperator': 'LessThanThreshold',
            'EvaluationPeriods': 1,
            'DatapointsToAlarm': 1,
            'MetricName': 'Faithfulness',
            'Namespace': 'GenAI/Quality',
            'Period': 300,
            'Statistic': 'Average',
            'Threshold': 0.60,
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'CRITICAL: Faithfulness below 0.60 - immediate action required',
            'TreatMissingData': 'breaching'
        }

    def _high_latency_alarm(self) -> Dict:
        """Alert on high P95 latency"""
        return {
            'AlarmName': 'RAG-High-Latency-P95',
            'ComparisonOperator': 'GreaterThanThreshold',
            'EvaluationPeriods': 3,
            'DatapointsToAlarm': 2,
            'MetricName': 'EndToEndLatency',
            'Namespace': 'GenAI/Performance',
            'Period': 300,
            'ExtendedStatistic': 'p95',
            'Threshold': 5000,  # 5 seconds
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'P95 latency exceeded 5 seconds',
            'TreatMissingData': 'notBreaching'
        }

    def _latency_spike_alarm(self) -> Dict:
        """Detect sudden latency spikes using anomaly detection"""
        return {
            'AlarmName': 'RAG-Latency-Anomaly',
            'ComparisonOperator': 'GreaterThanUpperThreshold',
            'EvaluationPeriods': 2,
            'Metrics': [
                {
                    'Id': 'm1',
                    'ReturnData': True,
                    'MetricStat': {
                        'Metric': {
                            'Namespace': 'GenAI/Performance',
                            'MetricName': 'EndToEndLatency'
                        },
                        'Period': 300,
                        'Stat': 'Average'
                    }
                },
                {
                    'Id': 'ad1',
                    'Expression': 'ANOMALY_DETECTION_BAND(m1, 2)',
                    'Label': 'Latency (expected)'
                }
            ],
            'ThresholdMetricId': 'ad1',
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Latency anomaly detected (2 standard deviations)',
            'TreatMissingData': 'notBreaching'
        }

    def _cost_spike_alarm(self) -> Dict:
        """Alert on unexpected cost spikes"""
        return {
            'AlarmName': 'RAG-Cost-Spike',
            'ComparisonOperator': 'GreaterThanThreshold',
            'EvaluationPeriods': 1,
            'DatapointsToAlarm': 1,
            'MetricName': 'TotalCost',
            'Namespace': 'GenAI/Cost',
            'Period': 300,
            'Statistic': 'Sum',
            'Threshold': 50.0,  # $50 per 5 minutes
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Cost spike detected: >$50 in 5 minutes',
            'TreatMissingData': 'notBreaching'
        }

    def _daily_budget_alarm(self) -> Dict:
        """Alert when approaching daily budget"""
        return {
            'AlarmName': 'RAG-Daily-Budget-Warning',
            'ComparisonOperator': 'GreaterThanThreshold',
            'EvaluationPeriods': 1,
            'DatapointsToAlarm': 1,
            'MetricName': 'TotalCost',
            'Namespace': 'GenAI/Cost',
            'Period': 86400,  # 24 hours
            'Statistic': 'Sum',
            'Threshold': 800.0,  # $800 per day (80% of $1000 budget)
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Daily cost approaching budget limit (80%)',
            'TreatMissingData': 'notBreaching'
        }

    def _high_error_rate_alarm(self) -> Dict:
        """Alert on elevated error rate"""
        return {
            'AlarmName': 'RAG-High-Error-Rate',
            'ComparisonOperator': 'GreaterThanThreshold',
            'EvaluationPeriods': 2,
            'DatapointsToAlarm': 2,
            'MetricName': 'ErrorRate',
            'Namespace': 'GenAI/Errors',
            'Period': 300,
            'Statistic': 'Average',
            'Threshold': 5.0,  # 5% error rate
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Error rate exceeded 5%',
            'TreatMissingData': 'notBreaching'
        }

    def _retrieval_failure_alarm(self) -> Dict:
        """Alert on retrieval failures"""
        return {
            'AlarmName': 'RAG-Retrieval-Failures',
            'ComparisonOperator': 'GreaterThanThreshold',
            'EvaluationPeriods': 1,
            'DatapointsToAlarm': 1,
            'MetricName': 'RetrievalErrors',
            'Namespace': 'GenAI/Errors',
            'Period': 300,
            'Statistic': 'Sum',
            'Threshold': 10,  # 10 retrieval failures in 5 min
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': 'Multiple retrieval failures detected',
            'TreatMissingData': 'notBreaching'
        }

    def _system_degraded_composite_alarm(self) -> Dict:
        """Composite alarm for multiple degradation signals"""
        return {
            'AlarmName': 'RAG-System-Degraded',
            'AlarmRule': (
                '(ALARM("RAG-Quality-Degradation") OR ALARM("RAG-High-Latency-P95")) '
                'AND ALARM("RAG-High-Error-Rate")'
            ),
            'ActionsEnabled': True,
            'AlarmActions': [self.sns_topic_arn],
            'AlarmDescription': (
                'System degraded: Multiple quality/performance/error issues detected'
            )
        }

# Usage
alarm_manager = GenAIAlarmManager(
    sns_topic_arn='arn:aws:sns:us-east-1:123456789:genai-alerts'
)

alarm_manager.create_all_alarms()

Integration with Existing Observability Tools

Grafana Integration

Datadog Integration

Key Takeaways

  1. CloudWatch GenAI Observability is purpose-built – Provides out-of-the-box dashboards for Bedrock model invocations and AgentCore agents. No custom instrumentation needed for basic metrics.

  2. OpenTelemetry + ADOT enables custom observability – Use ADOT to instrument your application with custom spans capturing retrieval quality, token usage, and costs. Automatically traces boto3 AWS SDK calls.

  3. X-Ray provides the service map – Distributed tracing shows bottlenecks across your RAG pipeline. Service maps visualize dependencies and highlight slow components (typically vector search).

  4. Comprehensive dashboards require custom metrics – Quality scores (faithfulness, relevancy), cost per query, and token breakdowns need custom CloudWatch metrics alongside out-of-the-box Bedrock metrics.

  5. Intelligent alarming prevents incidents – Set thresholds for quality degradation, cost spikes, and latency. Use composite alarms for multi-signal degradation detection. Anomaly detection catches unusual patterns.

  6. Integration extends visibility – Export to Grafana, Datadog, or existing observability stacks using CloudWatch exporters or direct API integration. Don’t build in isolation.

  7. Traces + Metrics + Logs = Complete picture – You need all three: traces for request flows, metrics for aggregates, logs for debugging specific failures. CloudWatch GenAI Observability provides this unified view.

What’s Next in This Series

Part 4: Production Hardening & Advanced Patterns (Coming Next Week – Series Finale!)

We’ll close the series with production-ready patterns:

  • Guardrails in production: Content filtering, PII detection, toxicity screening
  • Human-in-the-loop evaluation: Building feedback loops and annotation workflows
  • Incident response playbooks: What to do when GenAI fails at 3 AM
  • A/B testing strategies: Testing prompts, models, and RAG configurations
  • Canary deployments: Safe rollout strategies with automated rollback
  • Advanced cost optimization: Model routing, caching, and batch processing
  • Security hardening: Protecting against prompt injection and jailbreaks

Additional Resources

AWS Documentation:

  • CloudWatch GenAI Observability
  • AWS X-Ray Developer Guide
  • AWS Distro for OpenTelemetry
  • OpenTelemetry Python SDK

Sample Code & Workshops:

  • CloudWatch GenAI Observability Samples
  • AWS Observability Workshop
  • X-Ray SDK for Python

OpenTelemetry Resources:

  • OpenTelemetry Specification
  • ADOT Collector Documentation
  • Semantic Conventions for GenAI

Integration Guides:

  • Prometheus CloudWatch Exporter
  • Datadog AWS Integration
  • Grafana CloudWatch Data Source

Let’s Connect!

Building observability for production GenAI systems? Let’s share experiences!

Follow me for Part 4 (the series finale!) on Production Hardening & Advanced Patterns. We’ll cover guardrails, incident response, A/B testing, and cost optimization—everything you need to run GenAI at scale.

About the Author

shoaibalimir image

Shoaibali MirFollow

I’m an engineer with 4+ yrs of experience spanning across DevOps, Data, Cloud and AI/ML Engineering Domain. Along with full time work, I’m pursuing Masters Degree in AI/ML from BITS Pilani.

Connect with me on:

  • LinkedIn
  • Twitter/X

Tags: #aws #genai #observability #cloudwatch #xray #opentelemetry #monitoring #genaops #bedrock #distributedtracing

NgSysV2-10.2: PowerShell Scripting essentials

This post series is indexed at NgateSystems.com. You’ll find a super-useful keyword search facility there too.

Last reviewed: Apr’26

Introduction

The concept of “scripting” for PowerShell terminal session procedures was introduced in Post 4.3 without any attempt to describe the language’s syntax.

This post still doesn’t fully cover the subject, but here are the essentials, plus one or two “extras” I’ve found invaluable.

Variables, Arrays and Operators

A variable in PowerShell is a named container that holds a value, such as a string, number, array or object. PowerShell variables are dynamically typed, meaning you don’t need to declare a variable’s data type when assigning a value; it is determined by the assigned value. But note that once assigned, types are strongly enforced.

Variable names in PowerShell are introduced with a $ symbol. Variable names are not case sensitive, so, for instance, $MyVariable and $myvariable refer to the same variable.

Operators

Arithmetical — For mathematical calculations (+, -, *, /, %)
Comparative — For comparing values (-eq, -ne, -gt, -lt, -ge, -le)
Logical — For combining conditions (-and, -or, -not)

Comments

Introduce these with a # symbol,

Output

You can write to the console with the following arrangement

Write-Output "$MyMessage" # write this to the terminal

You can write to a file as follows:

$LogPath = "c:/path-to-my-log-file"
$MyMessage  = " .... "
Add-Content -Path $LogPath -Value $Message

Loops

For loop — Used when you know the exact number of iterations required. It’s commonly used for tasks such as incrementing counters or processing arrays.
While loop — Continues executing as long as a specified condition evaluates to True. It’s ideal for scenarios where the number of iterations depends on a dynamic condition.
ForEach loop — Designed for iterating through collections like arrays or output from commands.For example:

$i = 0 # Initialise a counter
while ($i -lt 5) {    
Write-Output "Iteration: $i"   
 $i++
}

Conditionals

Powershell’s If-Else syntax follows the following model:

$num = 10
if ($num -gt 5) 
{    Write-Output "$num is greater than 5"} 
else 
{    Write-Output "$num is less than or equal to 5"}

“Null”

In PowerShell,

$MyVariable = $null

means “This variable exists, but it contains no value”

It is particularly useful because it evaluates to “false”, along with :

  • $false → explicit boolean false
  • $0 → numeric zero
  • $”” → empty string
  • $@() → empty array

Which is very handy because you can now do:

$MyVariable = $null

if ($MyVariable) {
    Write-Output "True" # or - a handy shortcut - just say "True"
}
else {
    Write-Output "False"
}

Functions

PowerShell lets you define and reference shared blocks of code with the following model:

function Display-Greeting {
    param(
        [string]$Name,
        [int]$Count
    )
    Write-Output "Name: $Name"
    Write-Output "Count: $Count"

    for ($i = 1; $i -le $Count; $i++) {
        Write-Output "Hello $Name ($i)"
    }
}
...
Display-Greeting -Name "Martin" -Count 3

Options exist for the function to accept only valid parameter values – ask ChatGPT for details

Referencing a script with parameters

A parent script can call a subordinate “child” script as follows:

& ".child.ps1" -Param1 value1 -Param2 value2

Meanwhile, the slave script might be configured as follows:

param(
    [Parameter(Mandatory = $true)]
    [string]$Param1,

    [Parameter(Mandatory = $true)]
    [string]$Param2
)

The “Try … Catch” block

Just as in JavaScript, you can capture exceptions and direct them to a “problem-resolution” block:

try {    
    # Code that might throw an error   
} catch {    
    Write-Output "An error occurred: $_"} 
finally {    
    Write-Output "This code always runs, regardless of errors"
}

More usefully, you might consider opening an account with pushover and configuring your script to send a notification to your mobile phone. Registration and configuration are extremely easy and, last time I checked, a lifetime account with pushovercosts just $5.

} catch {    
    curl.exe -s -o NUL --form-string "priority=1" `
        --form-string "token=aiu7yk ..obfuscated... 5uerqq6ix" `
        --form-string "user=ueczz ..obfuscated... jrv54u22" `
        --form-string "message=Something has gone wrong with your nightly .. run" `
        https://api.pushover.net/1/messages.json
}

The -s -o NUL parameter bit above simply suppresses the display of the “response” from curl.exe, which is no help at all when this is running in the Windows scheduler.

But when adding a try block, you need to know that PowerShell does not throw exceptions for many errors by default (especially for external commands). You can fix this either by adding explicit error-handling instructions to individual commands:

Some-Command -ErrorAction Stop

or, more realistically, by setting global instructions with:

$ErrorActionPreference = "Stop"

The “pipeline”

An advanced PS1 script will make extensive use of an arrangement that lets you “chain” commands together with a “pipe” symbol – | – that passes the output from one command as an object that then provides input to the next. So, in

Get-Process | Sort-Object CPU
  • Get-Process → produces “process objects”
  • Sort-Object → receives those objects and sorts them by a property

No string parsing, no fragile text handling — just structured data flowing through. PowerShell passes objects, not text, and binds them by property name or type. Here’s another example:

Get-Service | Where-Object {$_.Status -eq 'Running'}
  • Get-Service → outputs service objects
  • Where-Object → filters them, selecting those with status “running”

Most pipelines follow this shape:

Producer | Filter | Transform | Output

For example:

Get-Process |
Where-Object {$_.CPU -gt 100} |
Select-Object Name, CPU |
Out-File processes.txt

You’ll use these constantly:

  • Where-Object → filter
  • Select-Object → pick properties
  • Sort-Object → sort
  • ForEach-Object → act on each item

What’s New in RustRover 2026.1

Welcome to RustRover 2026.1. This version focuses on supporting the way modern Rust teams build, test, and maintain their code. Highlights include:

  • Native cargo-nextest integration
  • Call hierarchy for faster navigation
  • Easier access to macro expansions
  • Configurable visibility on module creation
  • Support for more AI agents, including GitHub Copilot and Cursor
Download RustRover

RustRover 2026.1

Key updates

Code analysis is now more accurate

We’ve continued improving RustRover’s code analysis, with a recent focus on reducing false positives that can cause confusion.

If you notice any false positives, please report them in our issue tracker so we can keep improving code insight.

Run tests faster with cargo-nextest support in the IDE

Running tests in large Rust workspaces can be slow with the default test runner. Many teams rely on cargo-nextest for faster, more scalable execution, but until now, it required switching to the terminal.We’ve added native support for cargo-nextest directly in the IDE. You can now run and monitor nextest sessions with full progress reporting and structured results in the Test tool window, without leaving your development workflow.

Trace call chains more easily

If you’ve ever tried to trace how execution reaches a function in a trait-heavy codebase, a flat list of usages can be hard to interpret. You get the matches, but you lose the bigger picture of the call chain.

RustRover 2026.1 adds Call Hierarchy support for Rust, so you can explore call relationships in a dedicated view and navigate complicated code faster. The hierarchy is Rust-aware and distinguishes between trait method calls and calls to concrete implementations.

ACP Registry in RustRover

In addition to Junie, Claude Agent, and most recently Codex, RustRover now lets you work with more AI agents directly in the AI chat. You can choose from agents such as GitHub Copilot, Cursor, and many others supported through the Agent Client Protocol (ACP).

Choose module visibility on creation 

When you create a new module, you often know right away whether it should be public or private. Previously, that meant creating the file first and then updating visibility manually.

RustRover now lets you choose module visibility directly in the New Rust Module dialog. This means you can create public or private modules and attach them to a module in a single step, reducing cleanup and keeping project structure consistent.

Workflow improvements

Updated LLDB debugger

RustRover 2026.1 updates LLDB to version 21, bringing performance and reliability improvements for debugging sessions. Expect faster loading of debug information through improved DWARF indexing and parallel shared-library parsing, along with more reliable breakpoint behavior in inline code.   

Macro expansion, one step away

Rust macros can hide a lot of logic behind a single line. When you need to confirm what code will actually be compiled, seeing the expansion is often the fastest way to understand what is going on.

RustRover makes it easier to find macro expansions right where you need them. Use the gutter icon on macro calls or the ⌥↩ (macOS) / Alt+Enter (Windows/Linux) shortcut to open the Show Context Actions menu and inspect the generated code without leaving the editor. 

Bug fixes and code insight improvements

Code insight improvements for derive macros

Derive and procedural macros generate code behind the scenes, which can make IDE analysis harder than it looks in the source. 

RustRover 2026.1 improves name resolution to reduce misleading warnings and keep editor feedback more dependable. Expect cleaner inspections and steadier code insight in macro-heavy projects.

Restored trust in IDE diagnostics when working with rustc crates

If you work with nightly and compiler-internal crates (rustc_*), you may have seen RustRover report E0463 errors even though the project still built successfully. That mismatch can make it harder to rely on editor feedback when you are working close to compiler internals. This RustRover 2026.1 reduces these false positives, so diagnostics in the editor better match what you get from cargo build and cargo check when using rustc_* crates.

AI updates

Next edit suggestions, now quota-free 

Next edit suggestions help you apply related edits across a file, not just at the cursor. In RustRover 2026.1, they are available without consuming AI quota for JetBrains AI Pro, Ultimate, and Enterprise subscriptions, helping you keep changes consistent and stay in the flow while you iterate.

More agent options in the AI chat

RustRover now supports a wider choice of agents in the AI chat, including Junie and Codex,  so you can pick the one that best fits the task at hand. It allows you to switch between assistance styles without leaving the development workflow.

AI help for database work

When you’re working with a connected database, RustRover’s AI chat can help you query and analyze data, adjust SQL queries, and confirm changes right in the IDE. This keeps database work in the same flow as your code, instead of bouncing between tools. External agents can access the same database support through an MCP server.

Code With Me sunset

As we continue to evolve our IDEs and focus on the areas that deliver the most value to developers, we’ve decided to sunset Code With Me, our collaborative coding and pair programming service. Demand for this type of functionality has declined in recent years, and we’re prioritizing more modern workflows tailored to professional software development.

As of version 2026.1, Code With Me will be unbundled from all JetBrains IDEs. Instead, it will be available on JetBrains Marketplace as a separate plugin. 2026.1 will be the last IDE version to officially support Code With Me, as we gradually sunset the service.

Download RustRover 2026.1

What’s New in PyCharm 2026.1

Welcome to PyCharm 2026.1. This release doesn’t just add features – it rethinks how you build, debug, and scale Python projects. From a brand-new debugging engine powered by debugpy to first-class uv support on remote targets and expanded JavaScript support in the free tier, this version is all about removing friction and letting you focus on your code. Whether you’re working locally, over SSH, or inside Docker, PyCharm now adapts to your setup instead of the other way around.

In this post, we’ll explore the highlights of this update and show you how these improvements can streamline your daily workflow.

Standardizing the future of debugging with debugpy

PyCharm now offers the option to use debugpy as the default debugger backend, providing the industry-standard Debug Adapter Protocol (DAP) that aligns the IDE with the broader Python ecosystem. By replacing complex, legacy socket-waiting logic with a more stable connection model, race conditions and timing edge cases will no longer interfere with your debugging experience.

A modern foundation for Python development

The new engine provides full native support for PEP 669, utilizing Python 3.12’s low-impact monitoring API to significantly reduce debugger overhead compared to the legacy sys.settrace() approach. This ensures that your debugging sessions are faster and less intrusive. Furthermore, the migration introduces comprehensive asyncio support. You can now use the full suite of debugger tools, such as the debug console and expression evaluation, directly within async contexts for modern frameworks like FastAPI and aiohttp. 

Reliability across environments

Beyond performance improvements, debugpy simplifies the Attach to Process experience by providing a standardized approach for Docker containers, remote servers on AWS, Azure, or GCP, and local running processes. For specialized workflows, we have introduced a new Attach to DAP run configuration. This allows you to connect to targets using the debugpy.listen() command, eliminating the friction of manual connection management and allowing you to focus on your code instead of debugging infrastructure.

Support for uv as a remote interpreter

Many developers work on projects where the code and dependencies live on a remote server – whether via SSH, in WSL, or inside Docker. By connecting PyCharm to a remote machine and using uv as the interpreter, you can keep the environment fully synchronized, ensure package management works as expected, and run projects smoothly – just as if everything were local.

Free professional web development for everyone

With PyCharm 2026.1, the core IDE experience continues to evolve as we bring a broader set of professional-grade web tools to all users for free. Everyone, from beginners to backend-first developers, now has access to a substantial set of JavaScript, TypeScript, and CSS features, as well as advanced navigation and code intelligence previously available only with a Pro subscription.

For a complete breakdown of all new features, check out this blog post. 

Advancements in AI integration

PyCharm is evolving into an open platform that gives you the freedom to bring the AI tools of your choice directly into your professional development workflow. This release focuses on providing a flexible ecosystem where you can orchestrate the best models and agents available today.

The ACP Registry: Your gateway to new agents

Keeping up with the rapid pace of AI development can be a challenge, with new coding agents appearing almost daily. To help you navigate this dynamic landscape, we’ve launched the ACP Registry – a built-in directory of AI coding agents integrated directly into your IDE via the Agent Client Protocol.

Whether you want to experiment with open-source agents like OpenCode or specialized tools like Gemini CLI, you can now discover and install them in just a few clicks. If you have a custom setup or an agent that isn’t listed yet, you can easily add it via the acp.json configuration, giving you the flexibility to use your favorite tools, with no strings attached.

Native OpenAI Codex integration and BYOK

OpenAI Codex is now natively integrated into the JetBrains AI chat. This means you can tackle complex development tasks without switching to a browser or copy-pasting code between windows.

We’ve also introduced Bring Your Own Key (BYOK) support. You can now connect your own API keys from OpenAI, Anthropic, or other compatible providers – including local models – directly in the IDE settings. This allows you to choose the setup that fits your workflow and budget best, while keeping all your AI-powered development inside PyCharm.

Stay in the flow with next edit suggestions

Small changes in your code often trigger a cascade of mechanical follow-up edits. Adding a parameter to a function or renaming a symbol can lead to errors popping up across your entire file.

Next edit suggestions (NES) offer a smarter, lightweight alternative to asking an AI agent for a full rewrite. As you modify your code, PyCharm proactively predicts the most likely next changes and suggests them inline.

  • Effortless consistency: Update all call sites across a file with a simple Tab Tab experience.
  • Stay in control: Move step by step through changes rather than reviewing large, automated diffs.
  • No quota required: Use NES without consuming AI credits – available without consuming the AI quota of your JetBrains AI Pro subscription.

This natural evolution of code completion keeps you in the flow, making those small cascading fixes feel almost effortless.

All of the updates mentioned above are just a glimpse of what’s new in PyCharm 2026.1.

There is even more under the hood, including performance improvements, stability upgrades, and thoughtful refinements across the IDE that make everyday development smoother and faster.

To explore the full list of updates, check out our What’s New page. 

As always, we would love to hear your feedback. Your insights help us shape the future of PyCharm – and we cannot wait to see what you build next.

DataSpell 2026.1: AI Agents Ecosystem, Export Notebooks to PDF, Editor Improvements

With the 2026.1 release, DataSpell continues to improve the way you explore data, work with notebooks, and integrate AI into your workflows. This update expands the AI ecosystem with new agents support and brings several productivity improvements throughout the IDE.

Read on to discover everything that’s new in DataSpell 2026.1.

AI Agents ecosystem

DataSpell 2026.1 expands the range of AI tools you can use directly inside the IDE.

In addition to Claude Agent and DataSpell’s AI agent, the AI chat now supports Codex, Cursor, and many other agents through the Agent Client Protocol (ACP). This allows you to integrate the AI tools of your choice into your everyday workflows.

To make discovering and installing agents easier, we’ve introduced the ACP Registry. It lets you browse available AI agents and install them with a single click directly from the IDE.

Ability to export notebooks to PDF

You can now export Jupyter notebooks to PDF directly from DataSpell.

This new native export implementation works without requiring Python, nbconvert, or LaTeX. Instead, notebooks are converted directly inside the IDE, making the export process faster, simpler, and more reliable.

In-terminal completion

The integrated terminal now provides command and parameter suggestions as you type.

This helps you quickly discover available options when working with tools such as:

  • Git
  • Docker
  • Custom CLI utilities

Terminal completion makes working with command-line tools inside the IDE faster and more convenient.

Editor caret animation

The editor now includes new caret animation modes designed to improve the typing experience.

The two available modes are:

Snappy
The caret quickly jumps to the new position and slows slightly as it settles, providing smooth movement while keeping the editor responsive.

Gliding
The caret moves smoothly across the screen, making large cursor jumps easier to follow visually.

To use these modes, open Settings | Editor | General | Appearance, enable Use smooth caret movement, and select your preferred animation mode.

We hope you enjoy exploring DataSpell 2026.1! If you encounter a bug or have a feature suggestion, please share it on our issue tracker.

Stay up to date on the latest DataSpell news and data analysis tips – subscribe to our blog and follow us on X.

Using Spring Data JPA with Kotlin

This post was written together with Thorben Janssen, who has more than 20 years of experience with JPA and Hibernate and is the author of “Hibernate Tips: More than 70 Solutions to Common Hibernate Problems” and the JPA newsletter.

Spring Data JPA is based on the Jakarta Persistence specification and was originally designed for Java. That often raises the question of whether it is a good fit for Kotlin projects. 

The short answer is yes! 

You can use Spring Data JPA with Kotlin without any issues and enjoy Kotlin’s compact syntax and language features, like null safety and extension functions, when writing your business code.

And doing all of that is so quick and easy could explained in this short blog post. Let’s use Spring Data JPA with Kotlin to define and use a simple persistence layer.

Required dependencies

The easiest way to get started is to use the “New Project” wizard in IntelliJ. Once you select Kotlin and Spring Data JPA, the basic setup is done for you. That includes configuring the Kotlin no-arg and all-open plugins. They ensure that your Kotlin classes fulfill Jakarta Persistence’s requirements for non-final classes and parameterless constructors. You also get the kotlin-reflect dependency, which is required by Spring.

On the next page, you can select the Spring Boot Starter modules and other dependencies you want to use. In this example, that’s Spring Data JPA and the PostgreSQL database driver.

Adding Kotlin to an existing project

If you already have a Java-based Spring Boot project with the required dependencies, you can simply add a Kotlin class to it. Starting with version 2026.1 Intellij IDEA automatically adds the plugins plugin.spring and plugin.jpa to your build configuration and configure the all-open plugin.

In case you’re using an older IDEA version, you have to add the following configuration yourself.

plugins {
   kotlin("plugin.spring") version "2.2.20"
   kotlin("plugin.jpa") version "2.2.20"
}

allOpen {
   annotation("jakarta.persistence.Entity")
   annotation("jakarta.persistence.MappedSuperclass")
   annotation("jakarta.persistence.Embeddable")
}

Database and logging configuration

After defining your project’s dependencies, you need to set up the database connection in your application.properties file, and you can provide your preferred logging configuration. 

The following settings connect to a local PostgreSQL database and activate detailed Hibernate logging. The logging configuration instructs Hibernate to log the executed SQL statements and all bind parameter values. This information is extremely helpful during development and debugging, but it generates a lot of output. So, please make sure to use a different logging configuration in production.

spring.datasource.url=jdbc:postgresql://localhost:5432/postgres

spring.datasource.username=postgres
spring.datasource.password=postgres

logging.level.root=INFO
logging.level.org.hibernate.SQL=DEBUG
logging.level.org.hibernate.orm.jdbc.bind=TRACE

Modelling entities

You can then start modeling your entities in Kotlin. The Jakarta Persistence specification defines a few requirements for entity classes. As we explained in a recent article on common best practices, some of these requirements don’t align well with Kotlin’s data classes. But you will not run into any issues and enjoy Kotlin’s concise syntax if you define your entity classes as regular Kotlin classes and annotate the fields you want to persist.

It’s a general best practice to avoid exposing your entity classes and their technical dependencies in your API. Most teams introduce a second non-entity representation of their data for that. Using Kotlin, you can easily model those classes as a data class. Doing that requires additional mapping code to convert your data between the different formats. You could, of course, do that in your business code. But it’s much more comfortable to add a set of converter functions to your entity class or select the data class directly from the database. You will see an example of the second approach later in this article. Let’s concentrate on the entity class for now.

Here is a simple Person entity. It maps the person’s first and last name, along with a many-to-one relationship to the company they work for. 

The PersonData class represents the same information. You can use it in your API without exposing any technical details of your persistence layer. To make using this class as comfortable as possible, the Person entity class provides 2 functions with the required mapping code.

@Entity
class Person(
    @Id
    @GeneratedValue
    var id: Long? = null,

    var firstName: String? = null,

    var lastName: String? = null,

    @ManyToOne(fetch = FetchType.LAZY)
    var company: Company? = null,

    @Version
    var version: Integer? = null
) {
    fun createPersonData(): PersonData {
        return PersonData(
            id = id!!,
            firstName = firstName ?: "",
            lastName = lastName ?: "",
            version = version
        )
    }

    companion object {
        fun createPersonFromData(data: PersonData): Person {
            return Person(
                id = data.id,
                firstName = data.firstName,
                lastName = data.lastName,
                version = data.version
            )
        }
    }
}

data class PersonData(
    val id: Long,
    val firstName: String,
    val lastName: String,
    val version: Int
)

The functions 

The Company entity follows the same approach:

@Entity
class Company(
    @Id
    @GeneratedValue
    var id: Long? = null,

    var name: String = "default",

    @Version
    var version: Integer? = null
)

After you modeled your entity classes, you can start defining your repositories.

Designing and using a repository

Spring Data JPA’s repository abstraction works the same way in Kotlin as it does in Java. You extend one of the provided repository interfaces, such as JpaRepository or CrudRepository, and Spring Data provides you with an implementation. 

These repositories define a set of standard methods for fetching entities by primary key, persisting new entities, and removing existing ones. They also integrate with Spring’s transaction handling, so that you can use the @Transactional annotation on your business or API layer to define the transaction handling.

Here is an example of a simple PersonRepository definition. It inherits all standard methods defined by the JpaRepository. Let’s see how to add your own query methods in one of the following examples.

interface PersonRepository : JpaRepository<Person, Long> {}

To make it even easier, IntelliJ IDEA can create the repository for you automatically. Just start typing repository name in the service and IDEA will suggest creating it:

With this repository in place, you can focus on your business logic. That’s especially convenient in Kotlin because constructor injection and concise function definitions keep your classes short and focused. 

@Component
@Transactional
class PersonController (
   private val personRepository: PersonRepository) {
   fun createNewPerson(person : Person): Person {
       // add additional validations and/or logic ...
       return personRepository.save(person)
   }
}

In this example, Spring injects a PersonRepository instance and joins an active transaction or starts a new one before entering the createNewPerson method. If it started a new transaction, it also commits it after completing this method call. And the PersonRepository, together with the Jakarta Persistence implementation, provides the required code to create and execute a SQL INSERT statement that stores the provided Person object in the database.

2025-11-16T16:02:53.988+01:00 DEBUG 10104 --- [SDJWithKotlin] [           main] org.hibernate.SQL                        : select next value for person_seq
2025-11-16T16:02:54.012+01:00 DEBUG 10104 --- [SDJWithKotlin] [           main] org.hibernate.SQL                        : insert into person (company_id,first_name,last_name,version,id) values (?,?,?,?,?)
2025-11-16T16:02:54.014+01:00 TRACE 10104 --- [SDJWithKotlin] [           main] org.hibernate.orm.jdbc.bind              : binding parameter (1:BIGINT) <- [null]
2025-11-16T16:02:54.014+01:00 TRACE 10104 --- [SDJWithKotlin] [           main] org.hibernate.orm.jdbc.bind              : binding parameter (2:VARCHAR) <- [John]
2025-11-16T16:02:54.014+01:00 TRACE 10104 --- [SDJWithKotlin] [           main] org.hibernate.orm.jdbc.bind              : binding parameter (3:VARCHAR) <- [Doe]
2025-11-16T16:02:54.014+01:00 TRACE 10104 --- [SDJWithKotlin] [           main] org.hibernate.orm.jdbc.bind              : binding parameter (4:INTEGER) <- [0]
2025-11-16T16:02:54.015+01:00 TRACE 10104 --- [SDJWithKotlin] [           main] org.hibernate.orm.jdbc.bind              : binding parameter (5:BIGINT) <- [2]

You might know all of this from using Spring Data JPA with Java. Kotlin does not change any of this behavior, and you also get all the benefits from using Kotlin when implementing your business logic.

Let’s take a look at another example.

Fetching and updating an existing entity follows the same pattern. The updateLastName function loads the entity by its primary key and changes the lastName. That’s all you have to do. The Jakarta Persistence implementation finds that modification during its next dirty check and updates the database automatically.

@Component
@Transactional
class PersonController (
   private val personRepository: PersonRepository) {
   fun updateLastName(id : Long, lastName : String): Person {
       var person = personRepository.findById(id).orElseThrow()
       person.lastName = lastName
       return person
   }
}

As you can see, Kotlin’s concise syntax helps keep the business logic easy to read, and Spring handles all the boilerplate code for you. That makes implementing your application very comfortable.

Adding your own queries

In addition to the standard methods provided by Spring Data JPA’s repositories, you need to define queries that fetch the data used in your business code. You can do that in 2 ways, both of which work fine with Kotlin. 

The first and most convenient option is to use derived query methods. Spring analyzes the method name, derives the corresponding JPQL query, and binds the method parameter values. This is a good choice when your query is simple and only requires one or two bind parameters.

You can add a derived query method directly to your repository. Or In IDEA, you can start typing a desired method name and use autocompletion to have it automatically added to your repository.

[Video Snippet about repository method completion inside a repository]

You can see a typical example in the following code snippet. The findByLastName method fetches all Person entities with a lastName equal to the provided one.

interface PersonRepository : JpaRepository<Person, Long> {
   fun findByLastName(lastName: String): List<Person>
}

If your query becomes more complex, you should instead annotate your repository method with a @Query annotation. That allows you to write your own JPQL query and gives you full control over the executed statement. You can use joins, grouping, or any other JPQL feature you need.

Here you can see the same query statement as in the previous example. But this time, using a @Query annotation instead of Spring Data’s derived query feature.

interface PersonRepository : JpaRepository<Person, Long> {
   @Query("select p from Person p where p.lastName = :lastName")
   fun getByLastName(lastName: String): List<Person>
}

When you call one of these methods, Spring Data JPA uses Jakarta Persistence’s EntityManager to instantiate a Query, set the provided bind parameters, execute the query, and map the result to a managed Person entity object.

2025-11-16T16:47:20.949+01:00 DEBUG 16193 --- [SDJWithKotlin] [           main] 
org.hibernate.SQL                        : select 
p1_0.id,p1_0.company_id,p1_0.first_name,p1_0.last_name,p1_0.version from person p1_0 
where p1_0.last_name=?
2025-11-16T16:47:20.952+01:00 TRACE 16193 --- [SDJWithKotlin] [           main]
org.hibernate.orm.jdbc.bind              : binding parameter (1:VARCHAR) <- [Doe]

But entities are not the only projection you can use. For many use cases, a read-only DTO projection that only fetches the required information is more efficient. And Kotlin’s data classes are a great way to model such a DTO.

If you only want to show the first and last names of multiple people along with the company they work for, you could use the following PersonWithCompany data class.

data class PersonWithCompany(
    val firstName: String,
    val lastName: String,
    val company: String
)

In the next step, you can define a repository method that returns a List of those objects. If you annotate that method with a @Query annotation and provide a JPQL query that returns 3 fields with matching names, Spring Data JPA automatically maps each record to a PersonWithCompany object.

interface PersonRepository : JpaRepository<Person, Long> {
   @Query("select p.firstName, p.lastName, c.name as company from Person p join p.company c")
   fun findPersonsWithCompany(): List<PersonWithCompany>
}

As you can see in the log output, using a data class as your query projection combines the convenience of Kotlin data classes in your business code with the performance benefits of fetching only the required information from the database.

2025-11-16T16:59:42.260+01:00 DEBUG 22541 --- [SDJWithKotlin] [           main]
org.hibernate.SQL                        : select p1_0.first_name,p1_0.last_name,c1_0.name from 
person p1_0 join company c1_0 on c1_0.id=p1_0.company_id
2025-11-16T16:59:42.278+01:00  INFO 22541 --- [SDJWithKotlin] [           main] c.t.j.k.s.SDJWithKotlinApplicationTests  : PersonWithCompany(firstName=Jane,
lastName=Doe, company=Mighty Business Corp)

Provide your own repository method implementations

If you need more flexibility than Spring Data JPA’s @Query annotation provides, you can also add your own method implementations to a repository. You do that by creating an interface that defines only the methods you want to implement, letting your repository extend that interface, and providing an implementation of that interface. This is called a fragment repository.

In this example, the PersonFragmentRepository defines the searchPerson method that expects a PersonSearchInput parameter.

interface PersonFragmentRepository {
   fun searchPerson(searchBy : PersonSearchInput): List<Person?>?
}

data class PersonSearchInput(
   val firstName : String?,
   val lastName : String?,
   val worksForCompany : String?
) {}

In the next step, you have to implement the PersonFragmentRepository. The name of your class should be the interface name with the postfix Impl. Spring Data then automatically detects this class, wires it into your repository, and delegates all calls to the searchPerson method to your class.

The goal of the following searchPerson implementation is to check which fields of the PersonSearchInput object are set and consider only those fields in the query’s WHERE clause. This is a typical implementation for complex search dialogs, where users can choose which information to search for.

override fun searchPerson(searchBy: PersonSearchInput): List<Person?>? {
    val cBuilder = em.criteriaBuilder
    val cQuery = cBuilder.createQuery(Person::class.java)
    val person = cQuery.from(Person::class.java)
    val wherePredicates = mutableListOf<Predicate>()

    searchBy.firstName?.let {
        wherePredicates.add(cBuilder.equal(person.get<String>("firstName"), searchBy.firstName))
    }
    searchBy.lastName?.let {
        wherePredicates.add(cBuilder.equal(person.get<String>("lastName"), searchBy.lastName))
    }
    searchBy.worksForCompany?.let {
        val company = person.join<Person, Company>("company")
        wherePredicates.add(cBuilder.equal(company.get<String>("name"), searchBy.worksForCompany))
    }

    cQuery.where(*wherePredicates.toTypedArray())
    return em.createQuery(cQuery).resultList
}

       

As you can see in the code snippet, the searchPerson method uses Jakarta Persistence’s Criteria API to define a query based on the fields set on the provided PersonSearchInput object.

It first gets a CriteriaBuilder and uses it to create a CriteriaQuery that returns Person objects. It then defines the FROM clause and creates a List of Predicates. For each field of the PersonSearchInput object that’s not null, an equal predicate gets added to the wherePredicates List.

Thanks to Kotlin’s concise syntax and null handling, defining those Predicates is straightforward. Only the handling of the company name requires a little attention. If that field is set, you have to add a join to the Company entity before you can define the equal predicate.

You can then use the wherePredicates List to define the WHERE clause, execute the query, and return the result.

After you define the PersonFragmentRepository and implement it, you can use it in your repository definition. Let’s add it to the PersonRepository, which you already know from previous examples. It now extends Spring Data JPA’s JpaRepository and the PersonFragmentRepository.

interface PersonRepository : JpaRepository<Person, Long>, PersonFragmentRepository {
    fun findByLastName(lastName: String): List<Person>

    @Query("select p from Person p where p.lastName = :lastName")
    fun getByLastName(lastName: String): List<Person>

    @Query("select p.firstName, p.lastName, c.name as company from Person p join p.company c")
    fun findPersonsWithCompany(): List<PersonWithCompany>
}

When you use this PersonRepository in your business code, Spring Data JPA provides the implementations of all methods defined by the JpaRepository. It also generates the implementations of the 3 query methods. Only the calls to the searchPerson method get delegated to your PersonFragmentRepositoryImpl class.

Summary 

As you’ve seen in this article, Kotlin works well with Spring Data JPA. You can model your entities and define repositories in the same way you would in a Java application. Kotlin’s concise syntax often makes these parts of your code easier to read and maintain without changing any persistence behavior. If you follow the established Jakarta Persistence best practices for Kotlin, you get a smooth development experience and an efficient persistence layer.

About the author

Thorben Janssen

ReSharper C++ 2026.1: Better performance, improved Unreal Engine workflows, and language support updates

ReSharper C++ 2026.1 is here, bringing performance improvements, expanded language support, and better tooling for Unreal Engine development.

This release focuses on improved performance in large C++ codebases, enhancements in coding assistance and code analysis, and continuing support for modern C and C++ standards.

If you’d like to explore the full set of updates, take a look at the What’s New in ReSharper C++ 2026.1 page. In this post, we’ll walk through the most important changes and improvements.

Download ReSharper C++ 2026.1

Faster performance for large C++ projects

ReSharper C++ 2026.1 brings significant performance optimizations across all stages of the IDE experience, specifically tuned for the demands of large-scale Unreal Engine projects.

In our measurements on the Lyra sample project:

  • Initial C++ indexing is up to 20% faster
  • Warm startup is more than 20% faster
  • Backend memory usage after warm start is reduced by up to 21%

These improvements reduce the time it takes to open projects and make returning to them between sessions faster and more predictable.

Modern C and C++ language support

ReSharper C++ continues to evolve alongside the latest language standards.

This release adds support for:

  • C++26/C23 #embed directive
  • C2Y _Countof operator
  • C++23 extended floating-point types (bfloat16_t, float16_t, float128_t)

We’ve also improved compatibility with compiler-specific extensions, including support for GCC nested functions and Clang nullability qualifiers.

Coding assistance improvements

This release introduces several updates that reduce friction when writing and navigating code.

Auto-import for C++20 modules

ReSharper C++ can now automatically insert missing import declarations when you use symbols exported from C++20 modules, helping you avoid manual fixes.

Expanded postfix completion

Postfix completion now works in more scenarios:

  • Primitive types like int, bool, and float
  • Literals (e.g., 42.cos)
  • User-defined literal suffixes

Support for Unreal Engine development

ReSharper C++ 2026.1 improves Blueprint integration and includes compatibility fixes for the upcoming Unreal Engine 5.8.

Better Blueprint support

  • Code Vision now recognizes BlueprintPure functions
  • Event implementations in Blueprints are detected more accurately
  • Find Usages discovers delegate bindings in Blueprint assets
  • Find Usages for references in Blueprints now uses asset paths for more precise results

Plugin indexing and UE 5.8 compatibility

ReSharper C++ now indexes Unreal Engine plugins by default, improving code analysis and navigation out of the box.

We’ve also added compatibility fixes for Unreal Engine 5.8, supporting the upcoming changes to UnrealHeaderTool.

Code analysis updates

This release introduces new inspections to help catch subtle issues earlier:

  • Detecting and fixing out-of-order C++20 designated initializers
  • Warning about mismatched access level of overriding functions
  • Extending unused symbol analysis to class members in .cpp files

We’ve also updated the bundled Clang-Tidy, bringing the checks and improvements from the latest LLVM 22 release.

Improved navigation

Navigation has been refined to make working with complex codebases easier:

  • Tooltips for gutter icons now include semantic highlighting
  • New gutter icons for navigation to base classes complement existing icons for derived classes
  • Go to Declaration and other navigation actions can be invoked on the opening brace in brace initialization expressions

Editor UI improvements

ReSharper C++ 2026.1 restores tooltip support in Visual Studio 2026.

We’ve also updated editor UI elements, including completion lists, tooltips, and popups, for a cleaner and more consistent experience. These elements now scale properly across DPI settings and zoom levels.

Tell us your thoughts

This post covered the highlights, but if you want the full details on these improvements, head on over to the What’s New in ReSharper C++ 2026.1 page.

Download the latest release and let us know how it works for your projects.

Download ReSharper C++ 2026.1