<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc>
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="2"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-ietf-nmop-network-anomaly-architecture-07"
     ipr="trust200902" consensus="true" submissionType="IETF">
  <front>
    <title abbrev="Network Anomaly Detection Framework">A Framework for
		a Network Anomaly Detection Architecture</title>

    <author fullname="Thomas Graf" initials="T" surname="Graf">
      <organization>Swisscom</organization>

      <address>
        <postal>
          <street>Binzring 17</street>

          <city>Zurich</city>

          <code>8045</code>

          <country>Switzerland</country>
        </postal>

        <email>thomas.graf@swisscom.com</email>
      </address>
    </author>

    <author fullname="Wanting Du" initials="W" surname="Du">
      <organization>Swisscom</organization>

      <address>
        <postal>
          <street>Binzring 17</street>

          <city>Zurich</city>

          <code>8045</code>

          <country>Switzerland</country>
        </postal>

        <email>wanting.du@swisscom.com</email>
      </address>
    </author>

    <author fullname="Pierre Francois" initials="P." surname="Francois">
      <organization>INSA-Lyon</organization>

      <address>
        <postal>
          <street/>

          <city>Lyon</city>

          <region/>

          <code/>

          <country>France</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>pierre.francois@insa-lyon.fr</email>

        <uri/>
      </address>
    </author>

    <author fullname="Alex Huang Feng" initials="A." surname="Huang-Feng">
      <organization>INSA-Lyon</organization>

      <address>
        <postal>
          <street/>

          <city>Lyon</city>

          <region/>

          <code/>

          <country>France</country>
        </postal>

        <phone/>

        <facsimile/>

        <email>alex.huang-feng@insa-lyon.fr</email>

        <uri/>
      </address>
    </author>

    <date day="18" month="January" year="2026"/>

    <area>Operations and Management</area>

    <workgroup>NMOP</workgroup>

    <abstract>
      <t>This document describes the motivation and architecture of a
			Network Anomaly Detection Framework and the relationship to other
			documents describing network Symptom semantics and network
			incident lifecycle.</t>

      <t>The described architecture for detecting IP network service
      interruption is designed to be generic applicable and extensible.
			Different applications are described and examples are referenced
			with open-source running code.</t>
    </abstract>

    <note removeInRFC="true">
      <name>Discussion Venues</name>

      <t>Discussion of this document takes place on the Operations and
      Management Area Working Group Working Group mailing list
      (nmop@ietf.org), which is archived at <eref
      target="https://mailarchive.ietf.org/arch/browse/nmop/"/>.</t>

      <t>Source for this draft and an issue tracker can be found at <eref
      target="https://github.com/ietf-wg-nmop/draft-ietf-nmop-network-anomaly-architecture/"/>
      .</t>
    </note>
  </front>

  <middle>
    <section anchor="Introduction" title="Introduction">
      <t>Today's highly virtualized large scale IP networks are a
			challenge for network operation to monitor due to its vast number
			of dependencies. Humans are no longer capable to verify manually
			all the dependencies end to end in a timely manner.</t>

      <t>IP networks are the backbone of today's society. We
			individually depend on networks fulfilling the purpose of
			forwarding IP packets from a point A to a point B at any time of
			the day. A loss of such connectivity for a short period of time
			has today manyfold implications that can range from minor to
			severe. An interruption can lead to being unable to browse the
			web, watch a soccer game, access the company intranet or, even in
			life threatening situations, no longer being able to reach
			emergency services. Further, a congestion in the network leading
			to delayed packet forwarding can lead to severe repercussions on
			real-time applications.</t>

      <t>Networks are generally deterministic. However, the usage of
			networks are only somewhat. Humans, as in a large group of people,
			are somehow predictable. There are time of the day patterns in
			terms of when we are eating, sleeping, working or leisure. And
			these patterns are potentially changing depending on age,
			profession and cultural background.</t>

      <section anchor="Motivation" title="Motivation">
        <t>When operational or configurational changes in connectivity
        services are happening, it is crucial for network operators to
				detect interruptions within the network faster than the users
				utilizing the connectivity services.</t>

        <t>In order to achieve this objective, automation in network
        monitoring is required. The amount of people operating the
        network are today simply outnumbered by the amount of people
				utilizing connectivity services.</t>

        <t>This automation needs to monitor network changes holistically
				by supervising all 3 network planes simultaneously for a given
        connectivity service on the OSI (Open Systems Interconnection)
        layer 3. The monitoring system needs to detect whether
				configurational or operational State changes, an interface was
				shutdown by an operator versus an interface State went down due
				to loss of signal on the optical layer and wherever it disrupted
				the service, e.g. the received packets from customers are no
				longer forwarded to the desired destination, or not.</t>
				
				<t>Management plane relates to network node entities. Where
				control plane in turn propagates a subset o the management plane
				entities, the path reachability, to its neighboring network
				nodes accross the network. The forwarding plane requires a
				previously converged network topology and received packets to
				export metrics.</t>
				
				<t>A State change in control and management plane which are
				related to each other indicate a network topology State change
				while a State change in the forwarding plane describes how the
				packets are being forwarded. In other words, control and 
				management plane State changes can be attributed to network
				topology State changes whereas forwarding plane State changes
				are related to the outcome of these network topology State
				changes.</t>

        <t>Since changes in networks are happening all the time due to
				the vast number of dependencies, most of the changes are not
				negatively affecting the end to end connectivity due to
				redundancies in networks, a scoring system is needed to indicate
				how disruptive the change is considered. The scoring system
				needs to take into account the amount of transport sessions, the
				amount of affected flows and whether the detected interruptions
				are usual or exceptional.</t>
      </section>

      <section anchor="Scope" title="Scope">
        <t>Such objectives can be achieved by applying checks on network
        modeled time series data that contains semantics describing
				their dependencies across network planes. These checks can be
				based on domain knowledge or using outlier detection techniques.
        Domain-knowledge-based techniques applies the expertise of
				network engineers operating a network to understand whether
				there is an issue impacting the customer or not. On the other
				hand, outlier detection techniques identify measurements that
				deviate significantly from the norm and therefore are considered
				anomalous.</t>

        <t>The described scope does not take the connectivity service
				intent into account nor does it verify whether the intent is
				being achieved all the time. Changes to the service intent
				causing service disruptions are therefore considered service
				disruptions. On monitoring systems which take the intent into
				account, this is considered as intended.</t>

        <t>Also out of scope of this document are a gradual degredation
				of a connectivity service over a long period of time. An example
				would be optical fiber degredation which lead to malform packets
				on IP layer and therefore increases packet drops steadily.
				Outlier detection techniques can be applied here as well but
				instead of taking the network model, the component type and
				characterstics would be taken into context.</t>
      </section>
    </section>

    <section anchor="Conventions_and_Definitions"
             title="Conventions and Definitions">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
			"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",
			"NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to
			be interpreted as described in BCP 14 <xref target="RFC2119"/>
			<xref target="RFC8174"/> when, and only when, they appear in all
			capitals, as shown here.</t>

      <section anchor="Terminology" title="Terminology">
        <t>This document defines the following terms:</t>

        <t>Outlier Detection: Is a systematic approach to identify rare
				data points deviating significantly from the majority.</t>

        <t>Service Disruption Detection (SDD): The process of detecting
				a service degradation by discovering outliers in network
				monitoring data.</t>

        <t>Service Disruption Detection System (SDDS): A system allowing
				to perform SDD.</t>

        <t>Rules: Refers to rules defined by domain experts or
				artificial intelligence in context of detection strategies.
				See <xref target="Expert_Rules"/> for details on domain expert
				rules.</t>

        <t>Additionally it makes use of the terms defined in <xref
        target="I-D.ietf-nmop-terminology"/>, <xref
        target="I-D.ietf-nmop-network-anomaly-lifecycle"/> and
				<xref target="RFC8969"/>.</t>

        <t>The following terms are used as defined in <xref
        target="I-D.ietf-nmop-terminology"/> :</t>

        <t><list style="symbols">
            <t>Resource</t>

            <t>Event</t>

            <t>State</t>

            <t>Relevance</t>

            <t>Problem</t>

            <t>Symptom</t>

            <t>Alarm</t>
          </list></t>

        <t>Figure 2 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows characteristics of
				observed operational network telemetry metrics.</t>

        <t>Figure 4 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows relationships
				between, state, relevant state, problem, symptom, cause and
				alarm.</t>

        <t>Figure 5 in <xref section="3" sectionFormat="of"
        target="I-D.ietf-nmop-terminology"/> shows relationships between
        problem, symptom and cause.</t>

        <t>The following terms are used as defined in <xref
        target="I-D.ietf-nmop-network-anomaly-lifecycle"/> :</t>

        <t><list style="symbols">
            <t>False Positive</t>

            <t>False Negative</t>

            <t>Confidence Score</t>

            <t>Concern Score</t>
          </list></t>

        <t>The following terms are used as defined in <xref
				target="RFC8969"/> :</t>

        <t><list style="symbols">
            <t>Service Model</t>
          </list></t>

      </section>

      <section anchor="Outlier_Detection" title="Outlier Detection">
        <t>Outlier Detection, also known as anomaly detection, describes
				a systematic approach to identify rare data points deviating
        significantly from the majority. Outliers can manifest as single
				data point or as a sequence of data points. There are multiple
				ways in general to classify anomalies, but for the context of
				this document, the following three classes are taken into
				account:
				</t>

        <dl>
          <dt>Global outliers:</dt>

          <dd>An outlier is considered "global" if its behavior is
					outside the entirety of the considered data set. For example,
					if the average dropped packet count is between 0 and 10 per
					minute and, in a small time-window, the value gets to 1000,
					this data point is considered a global anomaly.</dd>
        </dl>

        <dl>
          <dt>Contextual outliers:</dt>

          <dd>An outlier is considered "contextual" if its behavior is
					within a normal (expected) range, but it would not be expected
					based on some context. Context can be defined as a function of
					multiple parameters, such as time, location, etc. An example
					of a contextual outlier is when the forwarded packet volume
					overnight reaches levels which might be totally normal for the
					daytime, but anomalous and unexpected for the nighttime.</dd>
        </dl>

        <dl>
          <dt>Collective outliers:</dt>

          <dd>An outlier is considered "collective" if the behavior of
					each single data point that are part of the anomaly are within
					expected ranges (so they are not anomalous in either a
					contextual or a global sense), but the group, taking all the
					data points together, is. Note that the group can be made
					within a single time series (a sequence of data points is
					anomalous) or across multiple types of metrics (e.g. if
					looking at two metrics together, the combined behavior turns
					out to be anomalous). An example of a collective outlier is
					when the network paths, interface State changes and dropped
					packet spike matches the network node, routing instance, VPN
          next-hop and the time range where these changes occured.
					Leveraging a network model to establish a relationship between
					the operational Network Telemetry timeseries data.</dd>
        </dl>

        <t>For each outlier a Confidence Score between 0 and 100 is
				being calculated. The higher the Confidence Score value, the
				higher the probability that the observed data point is an
				outlier. For each strategy a Confidence and Concern Score
				between 0 and 100 is being calculated. The higher the Confidence
				Score value, the higher the probability that the strategy
				detected an outlier. The higher the Concern Score value, the
				higher probability that the strategy detected a potential
				connectivity service degradation. <xref target="VAP09">Anomaly
				detection: A survey</xref> provides and discusses an overview on
				different anomaly detection techniques and the outlier detection
				approach adopted by each.</t>
      </section>

      <section anchor="Knowledge_Based_Detection"
               title="Knowledge Based Detection">
        <t>Knowledge-based anomaly detection, a superset of rule-based
				anomaly detection and a subset of semantic-based, <xref
				target="ASNL25">Knowledge-based anomaly detection: Survey,
				challenges, and future directions</xref>, is a technique used to
				identify anomalies or outliers by comparing them against
				predefined rules or patterns. This approach relies on the use of
				domain-specific knowledge to set standards, thresholds, or rules
				for what is considered "normal" behavior. Traditionally, these
				rules are established manually by a knowledgeable network
				engineer. Forward-looking, these rules can be expressed using
				human and machine readable network protocol derived Symptoms and
				patterns defined in ontologies.</t>

        <t>Additionally, in the context of network anomaly detection,
				the knowledge-based approach works hand in hand with the
				deterministic understanding of the network, which is reflected
				in network modeling. Components are organized into three network
				planes: the Management Plane, the Control Plane, and the
				Forwarding Plane <xref target="RFC9232"/>. A component can
				relate to a physical, virtual, or configurational entity, or to
				a sum of packets belonging to a flow being forwarded in a
				network.</t>

        <t>Such relationships can be modelled in Service and
				Infrastructure Maps (SIMAP) to automate that process. <xref
				target="I-D.ietf-nmop-simap-concept"/> defines the
				concepts for the SIMAP and <xref
				target="I-D.havel-nmop-digital-map"/> defines an application of
				the SIMAP to network topologies.</t>
		
        <t>These relationships can also be modeled in Knowledge Graphs
				<xref section="5" sectionFormat="of"
				target="I-D.mackey-nmop-kg-for-netops"/> using semantic triples
				<xref target="W3C-RDF-concept-triples"/>, where with ontologies,
				due to its declarative form, those semantic triples are 
				machine and human readable. See <xref
				target="Analyticsl_Observed_Symptoms"/> as an example for an
				ontology describing symptoms.</t>
      </section>

      <section anchor="Machine_Learning" title="Machine Learning">
        <t>Machine learning is commonly used for detecting outliers or
				anomalies. Typically, unsupervised learning is widely recognized
				for its applicability, given the inherent characteristics of
				network data. See <xref	target="VAP09"/>. Although machine
				learning requires a sizeable amount of high-quality data and
   			considerable advanced training, the advantages it offers make
   			these requirements worthwhile. The power of this approach lies
   			in its generalizability, robustness, ability to simplify the
   			fine-tuning process, and most importantly, its capability to
   			identify anomaly patterns that might go unnoticed to the human
   			observer.</t>
      </section>

      <section anchor="Data_Mesh" title="Data Mesh">
        <t>The <xref target="Deh22">Data Mesh</xref> Architecture
        distinguishes between operational and analytical data.
				Operational data refers to collected data from operational
				systems. While analytical data refers to insights gained from
				operational data.</t>

        <section anchor="Operational_Network_Data"
                 title="Operational Network Data">
          <t>In terms of network observability, semantics of operational
          network metrics are defined by IETF and are categorized as
					described in the Network Telemetry Framework <xref
					target="RFC9232"/> in the following three different network
					planes:</t>

          <dl>
            <dt>Management Plane:</dt>

            <dd>Time series data describing the State changes and
						statistics of a network node and its Resources. For example,
						Interface State and statistics modeled in
						ietf-interfaces.yang <xref target="RFC8343"/>.</dd>
          </dl>

          <dl>
            <dt>Control Plane:</dt>

            <dd>Time series data describing the State and State changes
						of network reachability. For example, BGP VPNv6 unicast
						updates and withdrawals exported in BGP Monitoring Protocol
						(BMP) <xref target="RFC7854"/> and modeled in BGP <xref
            target="RFC4364"/>.</dd>
          </dl>

          <dl>
            <dt>Forwarding Plane:</dt>

            <dd>Time series data describing the forwarding behavior of
						packets and its data-plane context. For example, dropped
						packet count modelled in IPFIX entity forwardingStatus(IE89)
						<xref target="RFC7270"/> and packetDeltaCount(IE2) <xref
            target="RFC5102"/> and exported with IPFIX <xref
            target="RFC7011"/>.</dd>
          </dl>
        </section>

        <section anchor="Analyticsl_Observed_Symptoms"
                 title="Analytical Observed Symptoms">
          <t>The Service Disruption Detection process takes operational
					network data as input and generates analytical metrics
					describing Symptoms and outlier pattern of the connectivity
          service disruption.</t>

          <t>The observed Symptoms are categorized into semantic triples
          <xref target="W3C-RDF-concept-triples"/>: action,
          reason, trigger. The object is the action, describing
          the change in the network. The reason is the predicate, 
          defining why this change occured and the subject is the
          trigger, which defines what triggered that change.</t>

          <t>Symptom definitions are described in <xref section="3"
          sectionFormat="of"
          target="I-D.ietf-nmop-network-anomaly-semantics"/> and
          outlier pattern semantics in <xref section="8"
					sectionFormat="of"
          target="I-D.ietf-nmop-network-anomaly-lifecycle"/>. Both are
					expressed in YANG Service Models.</t>
		  
          <t>However the semantic could also be expressed with the
					Semantic Web Technology Stack in RDF, RDFS and OWL
					definitions as described in <xref section="6"
					sectionFormat="of" target="I-D.mackey-nmop-kg-for-netops"/>.
					Together with the ontology definitions described in <xref
					section="3" sectionFormat="of"
          target="I-D.ietf-nmop-network-anomaly-semantics"/>, a
					Knowledge Graph can be created describing the relationship
					between the network state and the observed Symptom.</t>
        </section>
      </section>
    </section>

    <section anchor="Elements_of_the_Architecture"
             title="Elements of the Architecture">
      <t>The service disruption detection system architecture is aimed
			at detecting service disruptions and is built upon multiple
			components, for which design choices need to be made. In this
			section, we describe the main components of the architecture, and
			delve into considerations to be made when designing such
			componenents in an implementation.</t>

      <t>The system architecture is illustrated in <xref 
			target="fig_arch"/> and its main components are described in the
			following subsections.</t>

      <figure align="center" anchor="fig_arch"
              title="Service Disruption Detection System Architecture">
        <artwork align="center"><![CDATA[
    (1)-------+                     (11)----------------+
    | Service |                     |     Alarm and     |
|-- |Inventory|                     | Problem Management|
|   |         |                     |      System       |
|   +---------+                     +-------------------+
|     |                                      ^     Stream
|     |                                      |
|     |       (12)------+           +-------------------+
|     |       | Post-   | Stream    |   Message Broker  |
|     |       | mortem  | <-------- |  with Analytical  |
|     |       | System  |           |    Network Data   |
|     |       +---------+           +-------------------+
|     |            |                         ^     Stream
|     |            |                         |
|     | (8)        | (3)            +-------------------+ Store
|     | Profile    | Fine           | Alarm Aggregation | Label
|     | and        | Tune           | for Anomaly       | --------|
|     | Generate   | SDD            | Detection         |         |
|     | SDD Config | Config         +-------------------+         |
|     |            |                       ^  ^  ^ Stream         |
|     v            v                       |  |  |       Replay   v
|  (2)-----------------+ (9)        (6)-----------------+    (10)------+
|  | Service Disruption| Schedule   | Service Disruption|    |  Data   |
|  |     Detection     | ---------> |     Detection     |<---| Storage |
|  |   Configuration   | Strategy   |                   |    |         |
|  +-------------------+            +-------------------+    +---------+
|                                      ^ ^ Stream ^ ^ ^           ^
|                                      | |        | | |           |
|                                   (7)-------(5)-------+         |
|                                   | Network |  Data   | Store   |
|---------------------------------> |  Model  |  Aggr.  | --------|
                                    |         | Process | Operational 
                                    +---------+---------+ Data
                                           ^  ^  ^ Stream
                                           |  |  |
                                    +-------------------+
                                    |   Message Broker  |
                                    |  with Operational |
                                    |    Network Data   |
                                    +-------------------+
                                           ^  ^  ^ Stream
Subscribe                   Publish        |  |  |
      +-------------------+         (4)-----------------+
      | Network Node with | ------> | Network Telemetry |
----> | Network Telemetry | ------> |  Data Collection  |
      |   Subscription    | ------> |                   |
      +-------------------+         +-------------------+
      ]]></artwork>
      </figure>

      <section anchor="Arch_Inventory" title="Service Inventory">
        <t>A service inventory, (1) in <xref target="fig_arch"/>, is
				used to obtain a list of the connectivity services for which
				Anomaly Detection is to be performed. A service profiling
				process may be executed on the operational network data of the
				service in order to define a configuration of the service
				disruption detection approach and parameters to be used.</t>
      </section>

      <section anchor="Arch_Configuration"
			         title="Service Disruption Detection Configuration">
        <t>Based on this service list and potential preliminary service
        profiling, a configuration of the Service Disruption Detection,
				(2) in <xref target="fig_arch"/>, is produced. It defines the
				set of approaches that need to be applied to perform SDD, as
				well as parameters, grouped in templates, that are to be set
				when executing the algorithms performing SDD per se.</t>

        <t>As the service lives on, the configuration may be adapted,
        (3) in <xref target="fig_arch"/>,	as a result of an evolution of
				the profiling being performed. Postmortem analysis are produced
				as a result of Events impacting the service, or the occurrence
				of false positives raised by the Alarm system. These postmortem
				analysis can improve the deployed profiles parameters and
				creation of new service profiles. See upcoming section <xref
				target="Data_Profiling"/> for details on profiling.</t>
      </section>

      <section anchor="Arch_Collection" title="Operational Data
			Collection">
        <t>Collection of network monitoring data, (4) in <xref
				target="fig_arch"/>, involves the management of the
				subscriptions to network telemetry on nodes of the network, and
				the configuration of the collection infrastructure to receive
				the monitoring data produced by the network.</t>

        <t>The monitoring data produced by the collection infrastructure
				is then streamed through a message broker system, for further
        processing.</t>

        <t>Networks tend to produce extremely large amounts of
				monitoring data. To preserve scaling and reduce costs, decisions
				need to be made on the duration of retention of such data in
				storage, and at which level of storage they need to be kept. A
				retention time need to be set on the raw data produced by the
				collection system, in accordance to their utility for further
				used. This aspect will be elaborated in further sections.</t>
      </section>

      <section anchor="Arch_Aggregation" title="Operational Data
			Aggregation">
        <t>Aggregation, (5) in <xref target="fig_arch"/>, is the process
				of producing data sets based on collected network monitoring
				data upon which detection of a service disruption can be
				performed by filtering or aggregating.</t>

        <t>Pre-processing of collected network monitoring data is 
				usually performed to reduce input for the Service Disruption
				Detection component since not all metrics are relevant for this
				use case. Aggregating data prior to analysis thus may help to
				reduce the amount of data to be processed by the SDD component
				and thus speed up anomaly detection.This can be achieved in
				multiple ways, depending on the architecture of the SDD
				component. As an example, the temporal granularity of a metric
				may be reduced by calculating aggregated statistics for a longer
				time interval or the cardinality of a dataset is being reduced
				since certain metrics are not needed. Aggregation of input data
				into a coarser dimension may simplify and speed up SDD execution.
				</t>

        <t>A retention time for the operational data needs to be decided
				on Aggregated data and should reflect the expected further use.
				As example, the retention time must be set in accordance with
				the replay ability requirement discussed in <xref
				target="Arch_Replaying"/>.</t>
      </section>

      <section anchor="Arch_SDD" title="Service Disruption Detection">
        <t>Service Disruption Detection processes, (6) in <xref
				target="fig_arch"/>, decide whether a service might be
				degraded to the point where network operation needs to be
				alerted of an ongoing Problem within the network.</t>

        <t>Two key aspects need to be considered when designing the SDD
        component. First, the way the data is being processed needs to
				be carefully designed, as networks typically produce extremely
				large amounts of data which may hinder the scalability of the
				architecture. Second, the algorithms used to make a decision to
				alert the operator need to be designed in such a way that the
				operator can trust that a targeted Service Disruption will be
				detected (no false negatives), while not spamming the operator
				with Alarms that do not reflect an actual issue within the
				network (false positives) leading to Alarm fatigue.</t>

        <t>Two approaches are typically followed to present the data to
				the SDD system. Classically, the aggregated data can be stored
				in a database that is polled at regular intervals by the SDD
				component for decision making. Alternatively, a streaming
				approach can be followed so as to process the data while they
				are being consumed from the collection component.</t>

        <t>For SDD per-se, two families of algorithms can be decided
				upon. First, knowledge based detection approaches can be used,
				mimicking the process that human operators follow when looking
				at the data. Second, Machine Learning based approaches to detect
        outliers based from prior trained operational network data.</t>

        <section anchor="Knowledge_Based" title="Knowledge Based">
				<t>Knowledge based detection is comprised of several types of
				knowledge sources such as domain knowledge from network
				engineers <xref target="Expert_Rules"/> understanding the
				mechanics of network protocols and their implications, knowledge
				from relationships in the network topology <xref
				target="Network_Modeling"/>, knowledge derived from <xref
				target="Data_Profiling"/> where customer, human behavioral
				related aspects are taken into context and finally in
				<xref target="Detection_Strategies"/> a combination of that
				knowledge is being applied.</t>
				
          <section anchor="Expert_Rules" title="Expert Rules">
            <t>Some input to SDD is made of established knowledge from
						network engineers. This expertise can be used for both
						Service Disruption Detection Configuration or SDD, (2) and
						(6) in <xref target="fig_arch"/> respectively. For example,
						sudden spikes in drop counters from the forwarding plane are
						likely to be attributed to changes in the routing topology.
						Or, drops in the fowarding plane can manifest in an
						increase of flow counts in the forwarding plane due to the
						implied congestion and re-establishment of application
						transport sessions. These network behaviours are typically
						sourced from the experience of operating a network
						infrastructure by human operators, and can be used by an SDD
						engine to trigger alerts.</t>
          </section>

          <section anchor="Network_Modeling" title="Network Modeling">
            <t>Some input to SDD is made of established knowledge of the
  					network, (7) in <xref target="fig_arch"/>, that is unrelated
  					to the dimensions according to which outlier detection is
  					performed. For example, the knowledge of the network
  					infrastructure may be required to perform some service
  					disruption detection. Such data need to be rendered
						accessible and updatable for use by SDD. They may come from
						inventories, or automated gathering of data from the network
						itself.</t>
          </section>
  
          <section anchor="Data_Profiling" title="Data Profiling">
            <t>As expert rules cannot be crafted specifically for each
						customer because each customer has a different usage
						pattern, they need to be defined according to
						pre-established service profiles, (8) in <xref
						target="fig_arch"/>. Processing of monitoring data can be
						performed with machine learning methods in order to identify
						and group patterns into clusters and associate clusters with
						pre-established service profiles. External knowledge on
						customer types can also help in associating clusters with
						profiles.</t>
          </section>
  
          <section anchor="Detection_Strategies"
					title="Detection Strategies">
            <t>For a profile, a set of strategies is defined. Each
  					strategy captures one approach to look at the data (as a
						human operator does) to observe if an abnormal situation is
						arising. Strategies can use both expert rule-based
						algorithms, as described in <xref target="Expert_Rules"/>,
						or outlier detection algorithms, as explained in <xref
						target="Outlier_Detection"/>. Thus, a strategy defined as a
						combination of expert rule-based algorithms or outlier
						detection algorithms that together trigger an alarm when a
            disruption occur.</t>
  
            <t>When one of the strategies applied for a profile results
						in a concern score above a given threshold, an Alarm MUST be
						raised.</t>

            <t>Depending on the implementation of the architecture, a
  					scheduler may be needed in order to orchestrate the
						evaluation of the Alarm levels for each strategy applied for
						a profile, for all service instances associated with such
						profile, as illustrated in (9) from <xref
						target="fig_arch"/>.</t>
          </section>
        </section>

        <section anchor="Storage" title="Storage">
          <t>Storage, (10) in <xref target="fig_arch"/>, may be required
					to execute SDD, as some algorithms may be relying on
					historical (aggregated) monitoring data in order to detect
					anomalies. The cardinality,granularity and retention time of
					historical data should be carefully considered to avoid slow 
					and costly retrieval of this information if required for SDD
					analysis.</t>
        </section>
      </section>

      <section anchor="Arch_Alarm" title="Alarm">
        <t>When the SDD component decides that a service is undergoing a
        disruption, an aggregated relevant-state change notification,
				taking the output of multiple Service Disruption Detection
				processes into account, MUST be sent to the Alarm and
				Problem management system as shown in Figure 4 in <xref
				section="3" sectionFormat="of"
				target="I-D.ietf-nmop-terminology"/> and (11) in <xref
				target="fig_arch"/>. Multiple practical aspects need to be taken
				into account in this component.</t>

        <t>When the issue lasts longer than the interval at which the
				SDD component runs, the relevant-state change mechanism should
				not create multiple notifications to the operator, so as to not
				overwhelm the management of the issue. However, the information
				provided along with the Alarm should be kept up to date during
				the full duration of the
        issue.</t>
      </section>

      <section anchor="Arch_Postmortem" title="Postmortem">
        <figure anchor="simplified_lifecycle"
                title="Anomaly Detection Refinement Lifecycle">
          <artwork align="center"><![CDATA[


   Network Anomaly
     Detection             Symptoms
+-------------------+         &
|   +-----------+   | Network Anomalies
|   | Detection |---|---------+
|   |   Stage   |   |         |
|   +-----------+   |         v
+---------^---------+    +-------------------+   Labels  +------------+
          |              | Anomaly Detection |---------->| Validation |
          |              |   Label Store     |<----------|   Stage    |
          |              +-------------------+  Revised  +------------+
   +------------+             |                 Labels
   | Refinement |             |
   |   Stage    |<------------+
   +------------+    Historical Symptoms
                              &
                      Network Anomalies


          ]]></artwork>
        </figure>

        <t>Validation and refinement are performed during Postmortem
				analysis, (12) in <xref	target="fig_arch"/>.</t>

        <t>From an Anomaly Detection Lifecycle point of view, as
				described in <xref
				target="I-D.ietf-nmop-network-anomaly-lifecycle"/>, the
				Service Disruption Detection Configuration evolves over time,
        iteratively, looping over three main phases: detection,
				validation and refinement.</t>

        <t>The Detection phase produces the Alarms that are sent to the
				Alarm and Problem Management System and at the same time it
				stores the network anomaly and Symptom labels into the Label
				Store. This enables network engineers to review the labels to
				validate and edit them as needed.</t>

        <t>The Validation stage is typically performed by network
				engineers reviewing the results of the detection and indicating
				which Symptoms and network anomalies have been useful for the
				identification of Problems in the network. The original labels
				from the Service Disruption Detection are analyzed and an
				updated set of more accurate labels is provided back to the
				label store for version-control.</t>

        <t>The resulting labels will be then provided back into the
				Network Anomaly Detection via its refinement capabilities: the
				refinement is about the update of the Service Disruption
				Detection configuration in order to improve the results of the
				detection (e.g. false positives, false negatives, accuracy of
				the boundaries, etc.).</t>
      </section>

      <section anchor="Arch_Replaying" title="Replaying">
        <t>When a service disruption has been detected, it is essential
				for the human operator to be able to analyze the data which led
				to the raising of an Alarm. It is thus important that a SDDS
				preserves both the data which led to the creation of the Alarm
				as well as human understandable information on why the data led
				to the raising of an Alarm.</t>

        <t>In early stages of operations or when experimenting with a
				SDDS, it is common that the parameters used for SDD are to be
				fined tuned. This process is facilitated by designing the SDDS
				architecture in a way that allows to rerun the SDD algorithms on
				the same input.</t>

        <t>Data retention, as well as its level, need to be defined in
				order not to sacrifice the ability of replaying SDD execution
				for the sake of improving its accuracy.</t>
      </section>
    </section>

    <section anchor="Implementation" title="Implementation Status">
      <t>Note to the RFC-Editor: Please remove this section before
      publishing.</t>

      <t>This section records the status of known implementations.</t>

      <section anchor="Cosmos_Bright_Lights" title="Cosmos Bright
			Lights">
        <t>This architecture have been developed as part of a proof of
				concept started in September 2022 first in a dedicated network
				lab environment and later in December 2022 in Swisscom
				production to monitor a limited amount of 16 L3 VPN
				connectivity services.</t>

        <t>At the Applied Networking Research Workshop at IRTF 117 the
        architecture was the first time published in the following
				academic paper: <xref target="Ahf23"/>.</t>

        <t>Since December 2022, 20 connectivity service disruptions have
				been monitored and 52 false positives due to time series database
        temporarily not being real-time and missing traffic profiling,
        comparing to previous week was not applicable, occurred. Out of
				20 connectivity service disruptions 6 parameters where monitored
				and 3 times 1, 8 times 2, 6 times 3, 2 times 4 parameters
				recognized the service disruption.</t>

        <t>A real-time streaming based version has been deployed in
				Swisscom production as a proof of concept in June 2024
				monitoring approximate &gt;13'000 L3 VPN's concurrently.
				Improved profiling capabilities are currently under development.
				</t>
      </section>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>Security of the retained data. Compromised data could reveal
			sensitive information; could prevent valid alarms from being
			raised; or could cause false alarms.</t>
    </section>

    <section anchor="Contributors" title="Contributors">
      <t>The authors would like to thank Alex Huang Feng, Ahmed
			Elhassany and Vincenzo Riccobene for their valuable contribution.
			</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>The authors would like to thank Qin Wu, Ignacio Dominguez
	    Martinez-Casanueva, Adrian Farrel, Reshad Rahman, Ruediger Geib,
		  Paul Aitken and Yannick Buchs for their review and valuable
			comments.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.2119.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8174.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8969.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.9232.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-terminology.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-network-anomaly-semantics.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-network-anomaly-lifecycle.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.ietf-nmop-simap-concept.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.havel-nmop-digital-map.xml'?>

      <?rfc include='https://bib.ietf.org/public/rfc/bibxml-ids/reference.I-D.mackey-nmop-kg-for-netops.xml'?>
    </references>

    <references title="Informative References">
      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.4364.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.5102.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7011.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7270.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.7854.xml'?>

      <?rfc include='https://xml.resource.org/public/rfc/bibxml/reference.RFC.8343.xml'?>

      <reference anchor="Ahf23" target="https://hal.science/hal-04307611">
        <front>
          <title>Daisy: Practical Anomaly Detection in large BGP/MPLS and
          BGP/SRv6 VPN Networks</title>

          <author fullname="Alex Huang Feng" initials="A."
                  surname="Huang Feng"/>

          <date month="July" year="2023"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/3606464.3606470"/>

        <refcontent>IETF 117, Applied Networking Research
        Workshop</refcontent>
      </reference>

      <reference anchor="VAP09"
                 target="https://www.researchgate.net/publication/220565847_Anomaly_Detection_A_Survey">
        <front>
          <title>Anomaly detection: A survey</title>

          <author fullname="Varun Chandola" initials="V." surname="Chandola"/>

          <author fullname="Arindam Banerjee" initials="A." surname="Banerjee"/>

          <author fullname="Vipin Kumar" initials="V." surname="Kumar"/>

          <date month="July" year="2009"/>
        </front>

        <seriesInfo name="DOI" value="10.1145/1541880.1541882"/>

        <refcontent>ACM Computing Surveys 41</refcontent>
      </reference>

      <reference anchor="ASNL25"
                 target="https://hal.science/hal-05055886">
        <front>
          <title>Knowledge-based anomaly detection: Survey, challenges, and future directions</title>

          <author fullname="Abdul Qadir Khan" initials="A." surname="Qadir Khan"/>

          <author fullname="Saad El Jaouhari" initials="S." surname="El Jaouhari"/>

          <author fullname="Nouredine Tamani" initials="N." surname="Tamani"/>

          <author fullname="Lina Mroueh" initials="L." surname="Mroueh"/>

          <date month="May" year="2025"/>
        </front>

        <seriesInfo name="DOI" value="10.1016/j.engappai.2024.108996"/>
      </reference>

      <reference anchor="Deh22"
                 target="https://www.oreilly.com/library/view/data-mesh/9781492092384/">
        <front>
          <title>Data Mesh</title>

          <author fullname="Zhamak Dehghani" initials="Z." surname="Dehghani"/>

          <date month="March" year="2022"/>
        </front>

        <seriesInfo name="ISBN" value="9781492092391"/>

        <refcontent>O'Reilly Media</refcontent>
      </reference>

      <reference anchor="W3C-RDF-concept-triples"
                 target="https://www.w3.org/TR/rdf-concepts/#section-triples">
        <front>
          <title>W3C RDF concept semantic triples</title>

          <author fullname="Richard Cyganiak" initials="R." surname="Cyganiak"/>
          <author fullname="David Wood" initials="D." surname="Wood"/>
		  <author fullname="Markus Lanthaler" initials="M." surname="Lanthaler"/>

          <date month="February" year="2014"/>
        </front>
        <refcontent>W3 Consortium</refcontent>
      </reference>
    </references>
  </back>
</rfc>
