This is the write-up of a recent event we experienced on our network. This will be combination of a journal of symptoms, troubleshooting steps taken, and a brief overview of the environment and platforms involved. This isn’t a forensic analysis of the cause or of different behaviors in various environments. Rather, it’s meant to be a heads up if you see something similar in your environment.
Because this is an outage report, I’ll start with the good stuff (an explanation of details). Some steps during the investigation and details of our environment have been left out. My purpose is to share our experiences, symptoms, and commands used to identify the problem so that others faced with this situation can more quickly identify and isolate the problem.
Certain Intel NICs, when the host machine goes to sleep, will send out excessive amounts of IPv6 multicast listener report traffic.
In our case, these were Dell 9020s with Intel I217-LM NICs.
Through a combinations of details about our environment, this caused a large-scale outage. Initial troubleshooting did not lead to the cause, as will be detailed below.
Initial Observations
Our monitoring systems were reporting major outages throughout our network. We saw lots of messages indicating port down/ups. Just about all the ports on the Cisco VSS (2×6509),over 250 ports, were going up and down. There was no indication of why, just link up/down:
|
1 2 3 4 5 |
%LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface GigabitEthernet2/3/11, changed state to down %LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface Port-channel253, changed state to down %LINK-3-UPDOWN: Interface Port-channel12, changed state to up %LINK-SW1_SP-3-UPDOWN: Interface Port-channel253, changed state to down %LINEPROTO-SW1_SP-5-UPDOWN: Line protocol on Interface GigabitEthernet1/1/12, changed state to up |
Accessing switches on the other side of these links did not provide any information as to why this was happening.
Going through the basic troubleshooting steps didn’t show any useful information. The usual suspects of spanning tree, broadcast storms, and high processor utilization were all missing. The lack of high processor utilization turns out to be a platform dependent and may be different on your network.
We thought we had stabilized the network, but were proven wrong. I’ll leave out the details of what we did to get to that point. We opened a TAC case and dove deeper into the box. This is when we found the cause.
First off, my usual steps of checking switch health via show proc cpu were misleading. Our VSS is built on a Supervisor 720 which has separate route processor and switch processor components. Our route processor was fine, but the switch processor was pegged at 100%. This was determined by running a remote command on the switch processor:
|
1 2 3 4 5 6 |
remote command switch show proc cpu sort CPU utilization for five seconds: 99%/81%; one minute: 99%; five minutes: 99% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 103 15684 121518 129 100.00% 67.37% 66.97% 0 Heartbeat Proces 578 4819716 247249 19493 11.43% 8.85% 8.88% 0 LTL MGR |
What this tells us:
- Interrupt usage (81%) is very high – this is bad
- Heartbeat process is 100% – this is bad
The switch was so busy that is was unable to manage its line cards which explains why the ports were going up and down. Analysis of the syslog after the fact revealed a couple of interesting messages:
|
1 2 3 |
%PFREDUN-SW1_SP-7-KPA_WARN: RF KPA messages have not been heard for 27 seconds %MLSM-6-LC_SCP_FAILURE: NMP encountered internal communication failure for %ICC-SW1_SP-5-WATERMARK: 1055 pkts for class EARL_L2-DRV are waiting to be processed |
What this tells us:
- Keepalive messages haven’t been received. Best I can tell, this indicates that the standby processor hasn’t responded to keepalives (or it had responded, but the active SP couldn’t process it).
- The SP was unable to communicate with/update the CEF tables on the line cards; this caused traffic to be software switched, pouring gasoline on the fire
- Inter-card communication: There are heartbeat packets between the SP and line cards that are queued and waiting to be processed.
To determine what was causing the high SP CPU, we ran debug netdr capture rx. This captures packets destined for the CPU. In our case this was run on the SP because was the subsystem having a problem. The results can be viewed with show netdr captured-packets. A partial output:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
A total of 4096 packets have been captured The capture buffer wrapped 0 times Total capture capacity: 4096 packets ------- dump of incoming inband packet ------- interface NULL, routine mistral_process_rx_packet_inlin, timestamp 10:29:55.297 dbus info: src_vlan 0x373(883), src_indx 0x1070(4208), len 0x5A(90) bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x5802(22530) 2E820400 03730000 10700000 5A080000 0C000060 07000004 00000000 5802E3D8 mistral hdr: req_token 0x0(0), src_index 0x1070(4208), rx_offset 0x76(118) requeue 0, obl_pkt 0, vlan 0x373(883) destmac 33.33.00.00.00.01, srcmac C8.1F.66.A8.EA.87, protocol 86DD protocol ipv6: version 6, flow 1610612736, payload 32, nexthdr 0, hoplt 1 class 0, src FE80::CA1F:66FF:FEA8:EA87, dst FF02::1 ------- dump of incoming inband packet ------- interface NULL, routine mistral_process_rx_packet_inlin, timestamp 10:29:55.297 dbus info: src_vlan 0x373(883), src_indx 0x1070(4208), len 0x5A(90) bpdu 0, index_dir 0, flood 0, dont_lrn 0, dest_indx 0x5802(22530) 36820400 03730000 10700000 5A080000 0C000020 07000004 00000000 58027BC5 mistral hdr: req_token 0x0(0), src_index 0x1070(4208), rx_offset 0x76(118) requeue 0, obl_pkt 0, vlan 0x373(883) destmac 33.33.00.00.00.01, srcmac C8.1F.66.A8.73.29, protocol 86DD protocol ipv6: version 6, flow 1610612736, payload 32, nexthdr 0, hoplt 1 class 0, src FE80::CA1F:66FF:FEA8:7329, dst FF02::1 |
After parsing through the file, we determined a handful of machines were generating an inordinate amount of IPv6 multicast listener report traffic. The key things from this output:
-
protocol 86DD – IPv6
-
destination IPv6 – FF02::1 (all nodes multicast)
-
srcmac C8.1F.66.A8.73.29 – offending machine
-
next-header – 0 (hop-by-hop option)
-
hoplt 1 – Hop Limit of 1
Furthermore, it took just .092 seconds to collect the 4096 packets. There were 8 MAC addresses that stood out as clearly generating all this traffic. We estimate that the group of machines was generating about 40,000 packets per second which must be handled in software. Simply too much. The SP couldn’t handle the load and was unable to manage its own line cards, causing several hundred ports to flap.
To quickly stabilize things, we deleted the VLAN hosting this machines. Processor utilization dropped to normal levels.
To summarize
-
Direct Cause Analysis:
-
SP CPU so high that the switch was unable to maintain internal communications between itself and the line cards, causing all ports to flap.
-
-
Contributing Causes:
- Large, flat layer-2 domain
- Platform architecture with separate route and switch processors misleading initial troubleshooting
-
Root cause analysis:
-
Bad NIC driver from Intel causing machines in certain sleep states to generate inordinate amounts of IPv6 Multicast Listener Report traffic.
-
-
Remediation:
-
Deleted the VLAN hosting these machines, thereby preventing the traffic reaching the SP. This is neither a scalable nor permanent solution.
-
There are some indications that not having an SVI (VLAN interface) for this network (which is routed by a firewall) contributed either to the problem or its isolation. This remains unclear. Would the router processor have taken the brunt of the high processor load if there had been an SVI? Might have this been easier to troubleshoot in this topology?
It didn’t matter that the VSS didn’t have IPv6 enabled, which is of great concern.
Due to resource constraints, we are unable to do an in-depth analysis of how different platforms, topologies, etc would have acted in a similar situation. Please feel free to comment and share any experiences.
Other organizations have seen this problem and, to varying degrees, had network disruptions because of it. This article has some additional details about Dell machines seeing this issue, along with a pcap file. Intel is aware of the issue and there are fixes (at the driver level, BIOS level, and by disabling IPv6). Make sure your drivers are up to date, and for as much as I encourage the adoption of IPv6, if you aren’t using it, disabled it on your end stations.

I had the same thing happen last month. Nagios was telling me my lab’s switch stack couldn’t be reached. A quick check showed Nagios was wrong. Pings to the stack from Nagios showed packet loss. The problem was with the switch my Nagios box was connected to. The Cisco 2960’s CPU was spiking and dropping packets. Off to Wireshark, which showed about 6,000 pps on one of the VLANs. All ICMPv6 MLDs. I got the MACs and forwarded them to IT.
They found the two PCs, both new Dells that were hibernating. IT disabled IPv6 to fix the problem. I didn’t research the issue since I’m not IT, I filed it in the “not my problem” folder I think….
I’ll forward this on to them so they can patch the NICs/PCs.
Thanks for a great post!
-Ryan
Thanks for the feedback, Ryan. This was definitely one of those situations where I thought ‘I should share this so someone else doesn’t go through the pain I did’ 🙂
would a “(config-if)#storm-control multicast level” type throttling command on the access layer switches help with this issue at all?
Good question. I know in talking with TAC, there was concern that mitigating this type of traffic might be difficult. Thanks for the suggestion.
We’ve also seen the same thing since late last year with Dell machines in sleep mode. I believe there are BIOS updates to fix it properly.
It has also been discussed on cisco-nsp a couple of times. Search for “C3k: IPv6 multicast listener reports causes high CPU” or “AMT/vPro MLD storms”.
In our case our 6500+Sup720 would just sit with CPU at 100% solidly until the offending machines could be tracked down, disconnected, then upgraded. Having an SVI in the VLAN with the traffic did not help.
To prevent this and other similar issues seen over the last couple of years we are now filtering all IPv6 traffic at distribution. It was not possible to do that with our h’w/s’w combination at the 6500.
PS We don’t have our heads completely in the sand – our new network being rolled out at them moment has v6 support. Which probably wouldn’t have stopped this having an impact but might have made it easier to mitigate at least.
Nice write up. I too ran into this exact same issue with IPv6. This led me to police this traffic type heading to the control plane on the Juniper MX routers.
Certain NICs also bring this same behavior in the form of ARP flooding too while sleeping. Those too can be policed via non-shared policers on MX routers.
I too had this with a recent roll out of nearly 1 thousand new HP workstations accross multiple campus subnets. On my campus subnets without v6 support it was catastrophic. What I saw was that an individual machine while in standby mode would send out near 5,000pps of ipv6 multicast traffic, not much i know, but when just 1 additional would hit the net they started amplifying ^2. 2 stations would then = 10,000pps each. so it would only take 100 on a campus to cripple 1gb with 250k pps broadcasting on the unknowning switch gear.
On only a couple of sites out of the dozen I had found MLD on some of the newer gear would help segregate the traffic and restore some connectivity until the solution could be resolved by a newer nic driver at each PC. (halted deployment and re-imaged all deployed)
BUT, On most of my sites I do not have this so we are prone for a v6 issues such as this until we can upgrade.
Note- If you’ve got a fluke optiview it shows as ip-v6 icmp-other with nameB = All_Routers, and you’ll see your counts through the roof.
Also, MLD does not stop this issue but it does help to slow it down and not allow it to cripple the subnet.
I am not up to date on v6 stuff, so forgive me for anything wrong. Just wanted to share as we had this happen last week of march 2014.
Thanks for the affirmation. Just encountered this scenario with data in packet capture and wasn’t sure what I was dealing with yet. Unfortunately my 4500 access switch does not have a SUP capable of storm-control multicast. But, at least I can ask the support techs to look at changing the driver.
Thanks for sharing this info, we have been dealing with this exact issue this summer. High CPU on our network devices was the first symptom, a packet-capture showing tens of thousands of MLD was confirmation, and updating the Intel NIC drivers fixes the issue.
BTW, the fastest way to identify the offending hosts with Wireshark is Analyze->Endpoints. You will quickly see who is sending all those packets!
Isn’t CoPP effective against this issue?
I had the same exact thing happened New Year’s eve 2014. What a way to end the year. Spent hours troubleshooting this. Core switch, Catalyst 4510 couldn’t process routing traffic. EIGRP neighbors couldn’t stay up thus causing remote sites to be offline.
Solution… working with Cisco techs we put in place an IPv6 policy map to drop IPv6 traffic from the VLAN on which the PCs are located. Since we are not using IPv6 in our environment we do not need the IPv6 multicast causing problems.
Desktop team will be working on the HP PCs to deal with the driver problem.
Eugene, I helped another organization deal with this problem, around that same time! For them, it was Lenovo desktops with the i217 NIC. This has been one of the most wide-spread and disruptive network issue that I have seen in my 10 years as a network engineer.
I had the same with a Avaya network. The CPU was not on top, but the uplink ports was flapping up/down as you. We have desactived the IP V6.
We had and have a multicast treeshold level at 4000 packets but it does match it to close the port.
Just had something similar with new HP elitedesk workstations fitted with the Intel I217-LM cards. Thanks for your info, was able to speedily attempt a fix at the problem.
Thanks for the feedback. I’m glad that I was able to help minimize the impact. Let’s keep sharing war stories!
Same thing with HP Prodesk 600 SFF! Sleep mode generates anormous traffic in all connected networks also thrue different vlans!
Problem still not fixed with drivers, try disabling IPv6! Thanks for post!
We had a similar problem with Dell 9020 machines. We forced out the latest A12 BIOs and the new NIC driver after it was confirmed to be a fix during our testing. This instantly resolved our problem. Luckily monitoring caught this out before it became too disruptive. I was surprised to find out how widespread this was. Ipv6 remains enabled at the moment, but may look at disabling since it isn’t being used at the moment.
Hi friends, i am a network administrator in a company hosting about 2k end user´s pc, i had the same problem causing constant 100% cpu usage on my LAN access switches, no one garantee that other new pc could have the same problem neither a new pc network driver, so to fix this problem and to avoid network outages it is necesary to cut this traffic by enabling storm control on end users ports in access switches ´cause source multicast address aren´t learn by switches and this packets are flodeen as broadcasts messages in all ports, my configuration was 70 pps of multicast and broadcast , this fixed the problem without expecting another manufacturer problem with network cards
Hi Brian,
Could you please provide more information on how the access ports were configured to overcome this issue has I have set the icmp ipv6 to 250 kbps , bcast to 250kbps and mcast to 10% and I still have the same problem
Many Thanks,
Ravi
We just had this happen this past week. We added about 15 PC’s with this NIC. Two days later our network was down in the early am due to 100% CPU loads on our Brocade Layer 3 switches.
We couldn’t figure out he issue for the life of us. After doing some raw data dumps on the Brocades when they were only at 50% usage we say the IPv6 ICMP multicast flooding.
Google’d and found this article…bam. Our exact issue. Once more than 6 PC’s were asleep and flooding the switch would become overwhelmed and stop dropping processes.
Edited the GPO to stop sleep/hibernate and updated each PC to the newest Intel driver. Issue gone.
What a pain…..thanks Intel…LOL.
Hi,
We also encountered this kind of issue here in our company. Our company consist of more than 100 switches and it took us to troubleshoot for more than a day. The problem caused our distribution switch to hit high utilization and not affecting our core switch. That’s why we tried to isolate the problem upon checking using the debug command : show platform cpu packet buffered .
Index 0:
1313 days 3:22:19:213345 – RxVlan: 508, RxPort: Gi2/21
Priority: Normal, Tag: Dot1Q Tag, Event: 21, Flags: 0x40, Size: 90
Eth: Src 8C:DC:D4:31:54:C7 Dst 33:33:FF:2D:54:D4 Type/Len 0x86DD
Index 1:
1313 days 3:22:19:213524 – RxVlan: 508, RxPort: Gi2/21
Priority: Normal, Tag: Dot1Q Tag, Event: 21, Flags: 0x40, Size: 90
Eth: Src 64:51:06:5A:2C:A5 Dst 33:33:FF:2D:54:D4 Type/Len 0x86DD
We noticed all the distribution catalyst 4500 switches have the same logs and we trace the src mac and shut down the port and all went to normal .
Upon searching the cause of this mayhem I found out the Dst: Dst 33:33:FF:2D:54:D4 is a
IPv6 Multicast traffic with MAC address in the range 3333.xxxx.xxxx being punted to CPU.
cisco recommend the following :
1)Disable generation of IPv6 Multicast Listener Discovery traffic on end host. This can done by upgrading NIC drivers or disabling the feature on the BIOS of hosts sending IPv6 packets. You can contact your client machine’s vendor who can help to disable feature on BIOS or upgrade NIC drivers.
2)Enable Control Plane Policing (CoPP) in order to drop the excessive amount of IPv6 Multicast Listener Discovery traffic which is being punted to the CPU. And, these packets are hop limit of one link local, thus it is expected behavior that these packets will be punted to CPU.
ipv6 access-list IPv6-Block
permit ipv6 any any
!
class-map TEST
match access-group name IPv6-Block
!
policy-map ipv6
class TEST
police 32000 conform-action drop exceed-action drop
!
control-plane
service-policy input ipv6
In above example, we are limiting the amount of IPv6 traffic which is handled by the CPU to 32000 packets per second.
Upgrading the drivers in the OS does not work. We thought we finally nailed this (several months of no ipv6 storming), but recently 3 of our machines with upgraded drivers are again storming after a certain amount of sleep time. I don’t believe a GPO will fix this indefinitely. The kicker here is that this is definitely not happening on the layer 7 level. We have turned these machines all the way off (WOL disabled) and they continue to storm ipv6 “listener” packets. My colleague noticed that some of the Intel features in the bios (intel vpro) that we **purposely disabled** (because they were suspicious to us) are now re-enabled for some reason. Not sure if this is happening in the Intel Management Engine software or if it’s something as dumb as a MS Update. In any case, this problem continues to haunt us.
Del (and anyone else),
We are seeing this with HP workstations and in every case they are behind a Cisco 79xx series phone. This could be due to who this model is being distributed to in our organization or it could be an interaction. I wonder if anyone else is noticing a phone correlation?
If anyone has specific BIOS, OS, or driver changes for HP workstations that mitigate this please share.
So far I am seeing people post about disabling IPv6 in OS settings, disabling hibernation, updating NIC drivers (questionable success), and toggling BIOS and NIC driver options.
We have so far been able to mitigate with:
storm-control multicast level 0.01
on Cisco Cat 3xxx interfaces where these hosts are connected. Waking them up also seems to make the traffic go away.
Hello,
I can also see a phone (Cisco) correlation! We have Cisco SPA504G phones. We also have another brand which I don’t remember the name of.
We had about 5-6 PCs with Intel Ethernet I217-LM NIC. They where causing this mcast storm. The phones (and two AP’s) could not handle it and then died/lost IP/never requested new DHCP. The Cisco phones are only 100Mbps full duplex, the other phones are 1Gbps.
I could see about 25000 packets per second on router LAN interface. To me, it seems like the phones could not handle the amount of packets per second.
I also had another site where their pay-station and a printer died. Printer was an old one with only 10Mbps, pay-station (visa card etc) was only 100Mbps. This site also had Intel I217-LM NIC.
I fixed this with: HP-PROCURVE-SWITCH(eth-35)# rate-limit mcast in kbps 1000 until our Windows guy installs new driver.
Lenovo having the issue with the same NIC… Intel problem… We were able to pick up the traffic on wire shark and then trace it back to the nics… good luck- you can also find similar information here-
https://support.lenovo.com/us/en/documents/ht082464
We have the phone correlation with Mitel phones. The phones would putt out and then come back, very random and sporadic, In our call centers we have real time monitors and Email Queue monitors and they were continually kicking out. The thing was when the employees would go on break or lunch that is when our problems began… It would happen once a day and then stop for 3 days, it was so weird till we go the wire shark.
Seems this is the consistent issue:
Intel Ethernet Connection I217-LM Ethernet driver version 12.11.76.0, or below, installed.
I hope this helps.
Also experiencing the Same issue at one of our Remote head offices…
I have the following deployed on the Cisco’s
interface GigabitEthernet1/0/30
switchport block multicast
storm-control broadcast level bps 1m
storm-control multicast level bps 1m
storm-control action trap
spanning-tree portfast
spanning-tree bpduguard enable
spanning-tree guard loop
On the HP’s
interface 42
flow-control
broadcast-limit 1
We have switched off IPV6 in windows, on all the printers…
This IPv6 Multicast storm brings down the whole HQ…
We’ve just had an incident with this today.
In our case, the switch fabric was okay, but a budget LAN card in a cisco router caused the router to go to 98% CPU. Traced it back to 4 machines in sleep mode.
It’s interesting that when you wrote this article, Intel had a driver fix in the works. It’s been over 2.5 years since then and the problem is obviously not fixed yet.