The project I am working on has supercluster.We have like 10 Subscriber,1 Publisher, and 2 TFTP server. Not bad huh! It's been more than a year we are being hunted by this Db issue now and then we spend countless house week weekend on a call with Cisco trying figure out if we can find the root cause and fix it.
Well, we were able to fix database temporary but we were never lucky enough to find any valid root cause.I will try to see if I can get the story from the beginning but for now, we will just have to stick with what happened next.
email copied from TAC finding name has been changed for security reason.
Problem Description:
Phones are in rejected state when they are failed over to secondary node
Action Plan:
As discussed, please find the summary of the Webex we had:
++ If the phone is associated to the device pool with only secondary node, it registers fine
++ If the phone is associated to the device pool with primary and secondary node, phone fails over fine with secondary
++ Issue is only with one particular device pool
++ Took CCM traces, App and Sys logs, pcaps from secondary node, TFTP logs
++ From pcaps, phone sends a register request and receive a 404 not found from the CUCM, as it is not present in the database
Warning: 399 XYZ-CUCMSUB-01 "Unable to find device/user in database"
++ With SQL query we can see that the phone is present in the database
++ Checked replication, its fine
++ From CCM traces, it shows that phone is in DB but it cannot find that it is a member of CM group of the same device pool
81068379.007 |15:08:57.507 |AppInfo |Device=SEP123456789 in DB already but cannot register. isDeviceNameAllowedToRegister=CallManager Pkid(3a5-880) is not a member of Call Manager Group(PPP-CMG) (isCallManagerMemberOfDevicePool)
++ Created a new DP with all same subs in it
++ phone registers fine with secondary node if primary is down
As per the observation, it looks like the above error was happening because of the RIS Data Collector Service having incorrect information about this phone (since we are testing with one) that was trying to register. Even though the phones were not registered on the first subscriber in the group, the RIS DC assumed that the phones were using the old node in the CCM group This is why we see " SEP123456789” in DB already but cannot register.
After making new CM group, we noticed that the subscriber was having incorrect status of phones and Publisher was now showing right status of phones. Since RIS DC interfaces the memory between CCM and Tomcat, it looked like Tomcat was not picking the right entries from its memory, which RIS DC has to provide.
Action Plan –
++ Restart RIS DC, Tomcat and CCM service for that node
++ For further RCA, please collect detailed logs:
1) CUCM
2) TFTP
3) RIS