Confronting the dynamic challenges in Zigbee network security, particularly in key management and resilience against cyber threats such as DDoS attacks, this section introduces an innovative approach utilizing a RL model for adaptive key rotation.
Encryption and key management in zigbee
Zigbee employs advanced encryption standard (AES)128 for encryption, where the encryption function can be represented as
$$\beginaligned E_k(x)=\oplus _i=1^10 \left( k_i, x\right) \endaligned$$
(1)
where \(E_k(x)\) is the encrypted output, x is the plaintext input, \(k_i\) represents the key used in the \(i\text th \) round, and \(\oplus \) denotes the XOR operation. Meanwhile, key rotation is essential for maintaining security. The periodic rotation can be modeled as
$$\beginaligned t_\textitrotation =t_\textitinitial +n \times \Delta t \endaligned$$
(2)
where \(t_\textitrotation \) is the time for the next key rotation, \(t_\textitinitial \) is the time of the initial key establishment, n is the number of completed rotations, and \(\Delta t\) is the set time interval.
Reinforcement learning basics
RL is a machine learning method that learns by interacting with the environment. It attempts to learn a policy by maximizing the cumulative reward that reflects the effect of its action in the environment. Qlearning is a special RL algorithm. Each possible action in Qlearning has a corresponding Q value, which represents the pros and cons of taking that action in a specific state. The Qlearning update rule is given by
$$\beginaligned Q(s, a) \leftarrow Q(s, a)+\alpha \left[ r+\gamma \max _a^\prime Q\left( s^\prime , a^\prime \right) Q(s, a)\right] \endaligned$$
(3)
Here, Q(s, a) is the current Qvalue for a state s and action a. The update is based on the immediate reward r, the discounted maximum Qvalue of the next state \(s^\prime \) for all possible actions \(a^\prime \), \(\gamma \) is the discount factor (which balances immediate and future reward), and \(\alpha \) is the learning rate (which determines to what extent the newly acquired information overrides the old information). The policy \(\pi \) at any state s can be derived from the Qtable as
$$\beginaligned \pi (s)=\arg \max _a Q(s, a) \endaligned$$
(4)
where the policy at any state s is the action a that has the highest Qvalue in state s. Under the policy \(\pi \), the value function V can be calculated by
$$\beginaligned V^\pi (s)=\mathbb E\left[ \sum _t=0^\infty \gamma ^t r_t \mid s_0=s\right] \endaligned$$
(5)
The value function represents the expected cumulative reward starting from state s, following policy \(\pi \), where \(r_t\) is the reward at time t. Using the Bellman optimality equation, the above equation can be optimized as
$$\beginaligned V^*(s)=\max _a \sum _s^\prime P\left( s^\prime \mid s, a\right) \left[ r\left( s, a, s^\prime \right) +\gamma V^*\left( s^\prime \right) \right] \endaligned$$
(6)
The Bellman optimality equation provides the basis for finding the optimal value function \(V^*(s)\). It states that the value of a state s under an optimal policy is the maximum expected return achievable, taking into account the immediate reward r, the probability of transitioning to state \(s^\prime \) from state s taking action a, and the discounted value of the future state \(s^\prime \).
Reinforcement learning in Zigbee key rotation
Using RL in Zigbee key rotation, the detailed parameters are defined as follows: The state space \(S=\left\ s_1, s_2, s_3, s_4\right\ \), where \(s_1\) denotes time elapsed since the last key rotation, \(s_2\) is the number of detected unauthorized access attempts, \(s_3\) represents the network traffic volume, and \(s_4\) is the historical data of key rotation effectiveness. The action space A consists of two primary actions, rotate key \(\left( a_1\right) \) and maintain current \(\text key\left( a_2\right) \). The policy \(\pi \) is a function that maps states to actions. Using a softmax selection rule, the policy for state s can be expressed as
$$\beginaligned \pi (a \mid s)=\frace^Q(s, a) / \tau \sum _a^\prime \in A e^Q\left( s, a^\prime \right) / \tau \endaligned$$
(7)
where \(\tau \) is the temperature parameter controlling the explorationexploitation balance. Specifically, the high temperature parameter \(\tau \) promotes exploration by making the probability distribution over actions more uniform, encouraging the agent to try different actions and gather more information about the environment. The low temperature parameter \(\tau \) favors exploitation by concentrating the probability distribution on actions with higher estimated rewards, encouraging the agent to choose actions that have previously yielded high rewards.
In our method, \(\tau \) is dynamically adjusted to achieve an optimal balance between exploration and exploitation. Initially, a higher \(\tau \) is used to promote exploration. As the agent learns and gathers more information, \(\tau \) is gradually decreased according to an annealing schedule. The annealing schedule can be linear, exponential, or based on other decay functions. We used an exponential decay schedule, i.e. \(\tau =\tau _0 \times \exp (\lambda \times t) \), where \(\tau _0\) is the initial temperature, \(\lambda \) is the decay rate, and t is the time step. Finally, during training, \(\tau \) is periodically adjusted based on the agent’s performance. If the agent is not exploring enough (indicated by low variance in action selection), \(\tau \) is temporarily increased to encourage more exploration.
The reward function R(s, a) is designed to capture the immediate and longterm consequences of actions. It includes components for security, performance, and operational costs:
$$\beginaligned R(s, a)=w_1 \times \textitSecurity(s, a)+ w_2 \times \textitPerformance(s, a)w_3 \times \text Cost(s, a) \endaligned$$
(8)
where \(w_1\), \(w_2\), and \(w_3\) are weights indicating the importance of each component. A higher weight \(w_1\) is assigned to the security component to emphasize its importance in maintaining network integrity and protecting against attacks. A moderate weight \(w_2\) is assigned to the performance component to ensure that the key rotation method does not degrade overall network efficiency. A lower weight \(w_3\) is assigned to the cost component to ensure cost efficiency while not compromising security and performance.
Our objective is to find an optimal policy \(\pi ^*\) that maximizes the expected cumulative reward. The optimization problem can be formulated as
$$\beginaligned \pi ^*=\arg \max _\pi \mathbb E\left[ \sum _t=0^\infty \gamma ^t R\left( s_t, \pi \left( s_t\right) \right) \right] \endaligned$$
(9)
which subject to operational constraints like latency and resource usage.
Reward function and policy optimization
The Qlearning update formula, crucial in Zigbee networks, is given by
$$\beginaligned Q(s, a) \leftarrow Q(s, a)+\alpha _t \cdot \left( R(s, a)+\gamma \cdot \max _a^\prime \in A Q\left( s^\prime , a^\prime \right) Q(s, a)\right) \endaligned$$
(10)
where \(\alpha _t\) is the timedependent learning rate, which can be modeled as
$$\beginaligned \alpha _t=\frac\alpha _01+\textitdecayrate\cdot t \endaligned$$
(11)
where \(\alpha _0\) is the initial learning rate and decayrate determines the rate of reduction over time.
An advanced model for the Security component, could be represented by a weighted sum of various security metrics
$$\beginaligned \textitSecurity(s, a)=\sum _i w_i \cdot m_i(s, a) \endaligned$$
(12)
where \(m_i(s, a)\) represents different security metrics and \(w_i\) denotes their respective weights.
Evaluation criteria
Key performance indicators (KPIs) are monitored via inbuilt analytics tools of Network Simulator 3 (NS3), providing realtime data on network performance. The baseline for traditional key rotation method can be shown in Table 3.
In this paper, we employ advanced statistical techniques such as hypothesis testing and confidence interval analysis to assess the significance of the observed differences and explore real data simulations through the following comparative analysis formula to quantify improvements.
$$\beginaligned \triangle K P I=K P I_R LK P I_\textitTraditional \endaligned$$
(13)
where \(\triangle K P I\) represents the difference in performance metrics between the RL model and traditional method.
In order to evaluate the effectiveness, performance and cost of the proposed method, we use the following metrics for measurement.

1.
Security metrics
Security metrics evaluate the effectiveness of the proposed method in protecting the network against various threats. These metrics typically include

Detection Rate: The percentage of successful detections of attempted attacks.

False Positive Rate: The percentage of benign activities incorrectly identified as attacks.

Attack Mitigation Efficiency: The effectiveness of the method in mitigating the impact of detected attacks.


2.
Performance metrics
Performance metrics assess the impact of the proposed method on the overall network performance. These metrics include

Latency: The time delay introduced by the security measures.

Throughput: The rate at which data is successfully transmitted through the network.

Packet Loss: The percentage of data packets lost due to security interventions.


3.
Cost metrics
Cost metrics evaluate the computational and operational expenses associated with implementing the proposed method. These metrics include

Computational overhead: The additional processing power required to execute the security measures.

Energy consumption: The amount of energy consumed by the security operations, particularly relevant in resourceconstrained environments.

Implementation complexity: The effort and resources needed to deploy and maintain the security measures.

Qlearning algorithm design
First, all Qvalues are initialized to 0, representing an unbiased starting point, which can be expressed as
$$\beginaligned Q(s, a)=0, \forall s \in S, a \in A \endaligned$$
(14)
Then, the \(\epsilon \) greedy policy is used for action selection policy. The \(\epsilon \)greedy policy is a simple yet effective way to balance exploration and exploitation. At each step, with probability \(\epsilon \), a random action is chosen (exploration), and with probability 1\(\epsilon \), the action with the highest Qvalue is chosen (exploitation). The value of \(\epsilon \) is adaptively adjusted over time, starting from a higher value (encouraging exploration) and gradually reducing to favor exploitation.
Finally, the reward function is further enhanced to incorporate additional network parameters and security metrics. The revised function is given by:
$$\beginaligned R(s, a)=w_1 \cdot \textitSecurityMetric (s, a)+w_2 \cdot \textitNetwork Efficiency (s, a)w_3 \cdot \textitResourceUsage (s, a) \endaligned$$
(15)
where SecurityMetric measures the network’s security posture, Network Efficiency evaluates network performance, and ResourceUsage accounts for the consumption of network resources.
link