Monitoring process failures

When a process fails or restarts, the Availability Management Framework (AMF) module sends a log with information on why and how the process died. There are 3 main categories of logs for process failures.

Each category has a specific format that can be followed to parse different information from the log.

Each log entry uses this format:

Log entry format

TEXT

<DATE> <TIME> <hostname> <processid> Blue Cedar: [LOGCLASS], SubCls:<XYZ>, EID:          X, Type:   <type>, Sev:<severity>, <log details>

Each log can be parsed based on 3 items: the LOGCLASS, the SubCls, and the log-details. Each log-details field has a format specific to the category. To parse each log-details field correctly, please refer to the specifics of that category.

A process has died or was killed

When a process dies, AMF generates a log entry with these fields.

LOGCLASS: AMFAGENT
SubCls: 010
Log-details

Format of log-details from the generic log entry format at the top:

Log details example

TEXT

Hard Error#012Component: <component>#012Reporting Process: <process>#012Attributes: <attributes>#012Description: <description>#012Details: <details>

More info and examples are located below.

Field	Description
Component	A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example: TEXT `safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway`
Reporting Process	The internal process which is reporting the failure. This can usually be ignored.
Attributes	The attributes of why the process died (usually due to a Linux signal).
Description	Why the process died. This may also indicate the Linux signal number which caused the process to die.
Details	A string containing the processName of the process which has died and information about the process IDs.

Example:

CODE

Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway#012Reporting Process: elemProgramMgr#012Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated#012Description: Linux application died due to signal 15 (Terminated)#012Details: processName:aaa  spid:10071  pid:3013

To make the above example more clear, replace "#012" with a newline (\n):

CODE

Nov  4 15:26:13 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:010, EID:          0, Type:   Fault, Sev:Critical, Hard Error
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway
Reporting Process: elemProgramMgr
Attributes: type=SW,class=LINUX_SIGNAL,subclass=Terminated
Description: Linux application died due to signal 15 (Terminated)
Details: processName:aaa  spid:10071  pid:3013

AMF terminates a process

Each process responds to keepalive events sent by the AMF modules. If a process becomes unresponsive, AMF restarts the process. The default keepalive policy states that a keepalive will be sent every 60 seconds and a process must reply within 5 minutes. This cannot be modified. When AMF restarts the process, the log entry includes these fields.

LOGCLASS: AMFAGENT
SubCls: 013
Log-details

Format of log-details from the generic log entry format at the top:

Log details examples

TEXT

Healthcheck timeout for key <key>#012Component: <component>

Field	Description
Key	Internal Key ID. Usually `MAG_Watchdog`
Component	A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example: TEXT `safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway`

Example:

TEXT

Nov  4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID:     0, Type:   Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog#012Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway

To make the above example more clear, replace "#012" with a newline (\n):

TEXT

Nov 4 16:01:28 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:013, EID: 0, Type: Fault, Sev:Critical, Healthcheck timeout for key MAG_Watchdog
Component: safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway

A process is restarted too many times within the restart policy

The Connect Gateway has a default restart policy which keeps track of the number of times a process restarts within a period of time. This cannot be modified. The default policy says if a process restarts 3 times within 20 seconds, that process may not be started again.

LOGCLASS: AMFAGENT
SubCls: 999
Log-details

Format of log-details from the generic log entry format at the top:

Log details example

TEXT

CORE : Component <component> has restarted <number> times within component probation period of <period> ms, restart SU

Field	Description
Component	A string containing some internal information about the "component" (or process) which has died. Check "safComp" for the name of the process. Example: TEXT `safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway`
Number	The default restart policy for the number of times a process can restart is 3. This cannot be modified.
Period	The default restart policy for the restart period is 20 seconds (20000 ms). This cannot be modified.

Example:

TEXT

Nov  4 15:26:29 bluecedar-atlas journal: Blue Cedar: [AMFAGENT], SubCls:999, EID:          0, Type:  Config, Sev:Major, CORE : Component safComp=aaa,safSu=SU_1,safSg=SG_MAG_NON_Redundant_1,safApp=Bluecedar_Gateway has restarted 3 times within component probation period of 20000 ms, restart SU