Amazon Bedrock Knowledge Bases: Anatomy and the Confluence-Shaped Question

The default behind the console wizard

Amazon Bedrock Knowledge Bases packages a connector, a chunker, an embedding model, a vector store, and a retrieval API into one managed surface. The Confluence connector is a vector-store decision in disguise: AWS documents that Confluence Cloud is "currently" only available with the OpenSearch Serverless vector store, and the same lock-in applies to SharePoint and Salesforce. That single constraint splits the architecture into two paths, and choosing before you wire the connector saves a rewrite.

This is an evaluation read for someone choosing infrastructure, not a deployment trial report. Numbers below are quoted from current AWS pages with links inline; treat them as accurate as of May 2026 and re-verify on the day you commit.

What a Bedrock Knowledge Base actually is

A KB has five components. AWS owns the middle three; you own the ends.

Data source connector. You configure it, AWS runs the ingestion job. Options: S3, Confluence, SharePoint, Salesforce, Web Crawler, Custom (GA December 2024), and a separate "structured" shape for Redshift, Glue, and S3 Tables. See the connector index.
Chunking and parsing. AWS runs it. Strategies include fixed-size, semantic, and hierarchical; Bedrock Data Automation handles layout-aware parsing of complex PDFs.
Embedding model. AWS runs it, you pay per token. Titan Text Embeddings V2 and Cohere Embed v3 are the in-region defaults. The Bedrock pricing page is the source of truth.
Vector store. You own the database; OpenSearch Serverless is the only "create-for-me" option in the console. Aurora PostgreSQL with pgvector, OpenSearch Managed Cluster, Pinecone, MongoDB Atlas, Redis Enterprise Cloud, Neptune Analytics, and S3 Vectors are the supported alternatives per the prerequisites page.
Retrieval API. AWS runs it. Retrieve returns ranked chunks; RetrieveAndGenerate chains a model invocation onto the same call. You pay for the embedding of the query, the vector store lookup, and (optionally) the generation model.

An ingestion job is a connector run that walks the source, chunks each document, embeds each chunk, and upserts into the vector store. A Retrieve call embeds the query once and hits the store. That is the entire data plane.

Data sources

The connector list keeps growing, but the differences worth holding in your head are auth surface, "is this cloud-only", and what gets ingested.

Source	Status	Auth	Incremental sync	Cloud-only	Notes
S3	GA	IAM role	Yes (object metadata)	n/a	PDF, MD, HTML, DOCX, CSV. The default starting point.
Confluence	Preview, OpenSearch Serverless only	Basic API token or OAuth 2.0	Yes	Yes, `.atlassian.net` only	No Server or Data Center. Pages, blog posts, comments, attachments.
SharePoint	Preview, OpenSearch Serverless only	OAuth 2.0	Yes	Microsoft 365 only	Document libraries; native permissions do not flow to retrieval.
Salesforce	Preview, OpenSearch Serverless only	OAuth 2.0	Yes	Cloud	Knowledge articles, cases, custom objects.
Web Crawler	Preview	n/a	Configurable	n/a	Honours robots.txt; rate limits configurable.
Custom	GA (December 2024)	API push	n/a	n/a	Use the `KnowledgeBaseDocuments` ingestion API; pairs well with Kafka or CDC.
Structured (Redshift / Glue / S3 Tables)	GA	IAM	Schema-aware	n/a	Text-to-SQL shape, not vector retrieval. Different runtime.

Connectors marked preview can change. The "OpenSearch Serverless only" lock-in for Confluence, SharePoint, and Salesforce is documented on the Bedrock Confluence connector page and mirrored on the SharePoint and Salesforce pages.

Two things to underline. First, the Custom connector closed the most common "we have streaming data" gap in the original launch surface; treat it as the right answer when your source is an event stream, not a wiki. Second, structured KBs share the console wizard with vector KBs but they generate SQL against a tabular store. If your data is rows, do not bend an embedding pipeline around it.

Sources not on the list (GitHub repo, Hugging Face dataset)

A common ask is "can I index this GitHub repository or this Hugging Face dataset as a knowledge base." Neither has a first-class Bedrock connector. Two patterns work and a third is a trap.

The S3 indirection. Sync the repo or dataset to an S3 bucket on a schedule. For GitHub, a small Lambda triggered on push or a daily git clone --depth=1 into a versioned prefix is enough; for Hugging Face, huggingface_hub.snapshot_download(repo_id="org/name", repo_type="dataset", local_dir=...) followed by aws s3 sync is the canonical shape. Point an S3 data source at the bucket. You keep S3-shaped pricing, Aurora pgvector stays selectable, and the sync cadence is yours to control. This is the right default when the corpus is read-mostly.

The Custom data source. Use the Custom connector and push documents via the KnowledgeBaseDocuments ingestion API. This path makes sense when you want to control chunking per document, attach metadata that the S3 connector would not pick up (file paths in a monorepo, dataset card sections, commit SHAs), or when the source updates often enough that batch syncs feel wasteful. The trade is that your code owns freshness and deletion semantics.

The Web Crawler trap. It is tempting to point Bedrock's Web Crawler at github.com/org/repo or huggingface.co/datasets/org/name. Avoid it. Both platforms rate-limit anonymous crawlers aggressively, the rendered HTML is wrapped in app shell that becomes noise after chunking, and the connector does not authenticate per source, so private repos and gated datasets are out of reach entirely.

Vector store choices

The decision below branches first on data source type, because that branch is the one AWS makes for you. Confluence, SharePoint, and Salesforce connectors are only available with OpenSearch Serverless today; S3, Custom, and Web Crawler corpora can sit on any of the supported stores.

Why Aurora pgvector as the default for S3-only corpora. AWS prescriptive guidance for Bedrock KB lists Aurora pgvector as a first-class store. Aurora Serverless v2 can scale down to a small minimum ACU (see the Aurora Serverless v2 docs for current floors), the index is portable, and your team probably already operates Postgres. The trade is that you own the index lifecycle: ANALYZE, vacuum, ANN parameter choice on HNSW or IVFFlat.

OpenSearch Serverless is the right pick when retrieval latency is the bottleneck, you already operate an OpenSearch collection for other workloads, or you need hybrid search and k-NN tuning. The cost shape is the catch. The OpenSearch Serverless capacity page documents an OCU-based pricing model with a minimum baseline footprint. Quote the current OCU minimums from that page when you size, because the floor has been moving downward.

S3 Vectors is the newest entry. The S3 Vectors with Bedrock KB page describes the integration and the cost story, but the feature is young enough that latency parity is worth measuring before you pick it for a hot path.

Confluence connector close-up

Confluence is the single most-requested corpus on internal RAG projects, so the connector deserves its own section.

The vector-store lock-in. The Bedrock Confluence connector page states verbatim: "Amazon Bedrock supports connecting to Confluence Cloud instances. Currently, only Amazon OpenSearch Serverless vector store is available to use with this data source." The same constraint is documented for the SharePoint and Salesforce connectors. Attaching any of these to a KB forces the OpenSearch Serverless baseline footprint into your bill. Aurora pgvector is not selectable while Confluence is wired up.

Preview status. The same page notes: "Confluence data source connector is in preview release and is subject to change." Treat configuration fields, sync semantics, and the OpenSearch Serverless lock-in itself as moving targets, and re-read the page before any production cutover.

The hard endpoint constraint. The connector page names Confluence Cloud as the supported product. Confluence Server, Confluence Data Center, and any self-hosted instance behind a custom domain are not on the menu. Verify the corpus hostname ends in .atlassian.net before you commit to a demo.

Auth. Two modes: Basic with an admin email and an API token stored in AWS Secrets Manager, or OAuth 2.0 with a registered Atlassian app and refresh-token flow. Atlassian deprecated basic auth with passwords (not API tokens) and is steering REST consumers toward OAuth 2.0; the Bedrock connector's BASIC mode uses API tokens and is still supported, though token rotation policies have tightened.

What gets ingested. Pages, blog posts, comments, and attachments. No image, table, chart, or diagram content is extracted; the connector page is explicit that "Confluence data sources don't support multimodal data, such as tables, charts, diagrams, or other images." Attachment handling depends on the parsing strategy you configure on the KB. Bedrock Data Automation extracts text from complex PDFs; the default parser handles plain text and common formats. Very large attachments may be silently truncated by the parser. Worth flagging in your evaluation.

ACL behaviour. This is the most-misread feature of the entire surface. Confluence space and page restrictions are not enforced per user at retrieval time. Any principal with bedrock:Retrieve on the knowledge base sees every chunk in the index. If you need per-user filtering, your options are: restrict ingestion to spaces that match a single access tier, split the corpus across multiple KBs by tier, or apply metadata filtering on Retrieve calls. None of these are the same as ACL passthrough.

Sync. Incremental sync runs on update timestamps. Deletions in Confluence propagate but the cadence is not instantaneous. Check the connector docs for the supported sync modes.

Common gotchas to budget for. Atlassian per-app rate limits during the first full sync of a large space, mixed-space ingestion blowing up your token budget, and rate-limit hits causing partial sync states that you have to reconcile.

Sketch: a Confluence data source

The shape of the call is what matters. The AWS CLI reference for create-data-source and the Confluence configuration schema document the full field set.

bash

aws bedrock-agent create-data-source \  --knowledge-base-id "$KB_ID" \  --name "confluence-internal-docs" \  --data-source-configuration '{    "type": "CONFLUENCE",    "confluenceConfiguration": {      "sourceConfiguration": {        "hostUrl": "https://example.atlassian.net",        "hostType": "SAAS",        "authType": "OAUTH2_CLIENT_CREDENTIALS",        "credentialsSecretArn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:bedrock/confluence-oauth"      },      "crawlerConfiguration": {        "filterConfiguration": {          "type": "PATTERN",          "patternObjectFilter": {            "filters": [              {                "objectType": "Page",                "inclusionFilters": ["^/spaces/ENG/.*"]              }            ]          }        }      }    }  }'

aws bedrock-agent create-data-source \  --knowledge-base-id "$KB_ID" \  --name "confluence-internal-docs" \  --data-source-configuration '{    "type": "CONFLUENCE",    "confluenceConfiguration": {      "sourceConfiguration": {        "hostUrl": "https://example.atlassian.net",        "hostType": "SAAS",        "authType": "OAUTH2_CLIENT_CREDENTIALS",        "credentialsSecretArn": "arn:aws:secretsmanager:us-east-1:111122223333:secret:bedrock/confluence-oauth"      },      "crawlerConfiguration": {        "filterConfiguration": {          "type": "PATTERN",          "patternObjectFilter": {            "filters": [              {                "objectType": "Page",                "inclusionFilters": ["^/spaces/ENG/.*"]              }            ]          }        }      }    }  }'

A retrieval call once the ingestion job finishes:

bash

aws bedrock-agent-runtime retrieve \  --knowledge-base-id "$KB_ID" \  --retrieval-query '{"text": "what is our incident severity matrix"}' \  --retrieval-configuration '{    "vectorSearchConfiguration": {      "numberOfResults": 5,      "overrideSearchType": "HYBRID"    }  }'

aws bedrock-agent-runtime retrieve \  --knowledge-base-id "$KB_ID" \  --retrieval-query '{"text": "what is our incident severity matrix"}' \  --retrieval-configuration '{    "vectorSearchConfiguration": {      "numberOfResults": 5,      "overrideSearchType": "HYBRID"    }  }'

HYBRID search is available here because Confluence forces OpenSearch Serverless; on a pgvector-backed S3 KB you would use SEMANTIC, which is also the default. See the retrieve CLI reference for the full set of overrides.

Both calls return JSON; neither prints a success line. create-data-source returns the dataSourceId and a CREATING status, and retrieve returns a retrievalResults array. Plan your scripts around that shape.

Cost shape

Three buckets, each priced separately. All numbers below are quoted from the Bedrock pricing page and the OpenSearch Serverless pricing page as of May 2026. Re-verify on the day you commit.

Ingestion. You pay for embedding tokens at the embedding model's per-token rate. Titan Text Embeddings V2 is listed on the Bedrock pricing page; multiply your corpus token count by the listed rate. If you turn on Bedrock Data Automation for parsing, you also pay a per-page processing fee documented on the same page.

Query. Every Retrieve call embeds the query (a small token spend) and reads the vector store. If you use RetrieveAndGenerate, you additionally pay for the generation model invocation, which is usually the dominant per-query cost.

Vector store baseline. This is the bucket that catches teams. OpenSearch Serverless bills per OCU-hour with a minimum baseline footprint regardless of traffic. Aurora Serverless v2 with pgvector bills per ACU-hour and can scale to a small floor when idle. The OpenSearch baseline can dominate a small-corpus bill before the first query lands.

Worked math for a hypothetical 50,000-document corpus and 100 queries per day:

Ingestion embeddings: 50,000 documents at roughly 3,000 tokens each is 150 million tokens. Multiply by the Titan V2 per-1K rate from the Bedrock pricing page. The result is a one-time spend, not a recurring one.
Query embeddings: 100 queries per day × 30 days × ~200 tokens is 600,000 tokens per month at the same rate. Trivial.
Generation (if RetrieveAndGenerate): 3,000 queries per month × your chosen model's per-1K output rate. Usually the biggest line item once you turn generation on.
Vector store baseline: OpenSearch Serverless minimum OCU footprint × OCU-hour rate × 730 hours per month, versus Aurora Serverless v2 minimum ACU × ACU-hour rate × 730. For corpora this small, the Aurora floor is materially lower.

If you remember one thing about Bedrock KB pricing, remember that the vector store baseline is what separates a hobby-scale bill from a production-scale bill. Embedding tokens are usually rounding error.

When not to use Bedrock KB

Bedrock KB is a fit when your corpus is unstructured, your sources are S3 or one of the cloud SaaS connectors, and you do not need per-user ACL at retrieval. Cases where it is the wrong tool:

Confluence Server or Data Center. Self-hosted Confluence is not supported. Use Amazon Kendra's Confluence connector, which does support both Cloud and Server, or roll your own ingestion. Kendra connectors documentation covers the surface.
Per-user ACL at query time. If your retrieval results must respect each user's source-system permissions, you need either Kendra (which has native ACL passthrough on several connectors) or a hand-rolled pipeline that resolves user permissions at query time.
Hybrid graph plus vector. Knowledge that is meaningfully graph-shaped is a fit for Neptune Analytics or a dedicated graph database with vector support.
Very small corpora (under ~1,000 documents). Stuffing the whole corpus into a model context or running a single-process FAISS index is often simpler and cheaper.
Very large corpora (over ~10 million chunks). A dedicated vector database, an OpenSearch Managed Cluster, or a custom pipeline gives you tuning levers the KB managed surface does not expose.

Roll-your-own (S3 plus a Lambda chunker plus pgvector or OpenSearch plus a small retrieval service) is also a valid choice. You give up the managed connector and chunking surface; you gain full control over embedding model swaps, parsing strategy, sync semantics, and cost tracking. For a corpus you intend to evolve aggressively, that control is worth more than the managed convenience.

Common pitfalls

Treating Confluence space permissions as enforced at retrieval. They are not.
Picking OpenSearch Serverless for a small POC and being surprised by the baseline bill.
Conflating structured KBs (text-to-SQL) with vector KBs in the same architecture diagram. They share a console wizard, not a runtime.
Assuming the Confluence connector works for self-hosted Confluence. It does not.
Forgetting that Custom data sources are push-only; your code owns freshness.
Using RetrieveAndGenerate when Retrieve would do. You pay for an extra model invocation either way.

Closing

The recommendation lands in three boundaries. If your corpus is S3-only and under roughly a million documents, use Bedrock KB on Aurora PostgreSQL with pgvector; you keep a portable index and avoid the OpenSearch Serverless baseline footprint. If your corpus must include Confluence, SharePoint, or Salesforce, budget the OpenSearch Serverless baseline as a fixed cost of admission and choose between Bedrock KB and Kendra on ACL and multimodal needs. Outside both envelopes (Confluence Server, per-user ACL at retrieval, graph-shaped data, very large corpora), roll your own pipeline against a vector store you control.

References

Retrieve data and generate AI responses with Amazon Bedrock Knowledge Bases - Canonical AWS documentation for what a Bedrock Knowledge Base is and how it is composed
Connect a data source to your knowledge base - Full matrix of supported connectors with per-connector configuration
Connect to Confluence for your knowledge base - Cloud-only constraint and the supported auth modes for the Confluence connector
Connect your knowledge base to a custom data source - Documentation for the Custom connector that went GA in December 2024
Prerequisites for using a vector store you created for a knowledge base - Supported vector stores and required index settings
Using S3 Vectors with Amazon Bedrock Knowledge Bases - Newest vector store integration; verify currency at publish time
Managing capacity limits for Amazon OpenSearch Serverless - OCU floors and redundancy semantics that drive the baseline cost
Amazon Bedrock Pricing - Source of truth for embedding and generation model pricing
RAG fully managed with Amazon Bedrock (AWS Prescriptive Guidance) - Official decision-framing document for Bedrock-based RAG
Amazon Bedrock Knowledge Bases now supports custom connectors and ingestion of streaming data - GA announcement for the Custom connector
Amazon Bedrock Knowledge Bases now supports Amazon OpenSearch Managed Cluster for vector storage - Added vector store option
Dive deep into vector data stores using Amazon Bedrock Knowledge Bases - AWS-authored comparison of the supported vector stores
Amazon Kendra Confluence connector documentation - The native ACL alternative when Bedrock KB does not fit
Atlassian deprecation notice for Basic auth with passwords on Confluence Cloud - Scope of the deprecation (passwords, not API tokens) and the OAuth 2.0 migration guidance

The default behind the console wizard#

What a Bedrock Knowledge Base actually is#

Data sources#

Sources not on the list (GitHub repo, Hugging Face dataset)#

Vector store choices#

Confluence connector close-up#

Sketch: a Confluence data source#

Cost shape#

When not to use Bedrock KB#

Common pitfalls#

Closing#

References#

Related Posts