Analyse av systemsikkerhet Tor Stålhane IDI / NTNU
Program Tid Kommentarer 11:15 – 12:00 Om risiko, pålitelighet, safety og security Preliminær Hasard Analyse (PHA) – generell innføring HazOp – generell innføring FeilModeEffekt Analyse (FMEA) – generell innføring 12:00 – 13:15 Lunsj 13:15 – 14:00 FeilTre Analyse (FTA) – generell innføring PHA – eksempel fra EPJ Øving 1: PHA på system for togkontroll 14:15 – 15:00 HazOp – eksempel fra EPJ Øving 2: HazOp på system for togkontroll 15:15 – 16:00 FMEA - eksempel FTA – eksempel Øving 3: FMEA på en komponent i systemet for togkontroll 16:15 – 17:00 Oppsummering og diskusjon
The Happy Scenarios When making scenarios for a new system, all focus tends to be on the success – requirements for the happy scenarios. Our main concern is to Identify hazards Find out how to prevent hazards Convert the prevention strategy into requirements. This means that we have to look beyond the happy scenarios.
The Dark Side Developers have always be primarily concerned with solving the customer’s problems. What is all too often forgotten is that our solutions can introduce new problems – the dark side of the system, as opposed to the happy scenarios. We will focus on the dark side of systems’ development.
Reliability and Safety - 1 It is important to keep two concepts separated: Reliability – does the system do what the requirements says it shall do? Safety – will the system refrain from hurting people or destroying equipment and environment? Security – will the system prevent unauthorized access? We will focus on safety and security.
Reliability and Safety - 2 We can, however, not ignore reliability. The question of reliability is important when we want to insert a barrier against a hazard. We need to know the probability that the Hazard will occur Barrier will work when needed – reliability
The Safety Perspective We need tools and methods to Identify what can go wrong - hazards Translate the hazards into requirements that will help us to avoid the hazards Develop tests that check that the hazard-avoidance requirements are correctly implemented
Important Methods - 1 Methods used in hazard analysis should: Be simple so that all types of personnel can participate Be transparent in order to give confidence Allow inclusion of hardware, software and operators
Important Methods - 2 The method we use must have a process – a way of doing the job. This is important in order to be able to Teach the method Improve the method as we get more experience – process improvement.
Important methods - 3 PHA and HazOp – what can go wrong and what are the consequences Fault Tree Analyses – how can it go wrong Failure rate budgeting – what will be the reliability requirements for each component Failure Mode and Effect Analysis (FMEA) – what are the consequences of a component failure
How to identify Hazards - 1 Our ability to identify hazards and the method to be used will depend on the amount of information available at the point of analysis. We will discuss situations where we have: A concept or an idea for a system. A set of stories. A set of use cases, be it diagrams or textual descriptions.
How to identify Hazards - 2 Irrespective of method we need: Information about the environment – where will the system be used? People with knowledge of and experience with the application domain – the stakeholders.
How to identify Hazards - 3 All methods for hazard identification is really a structured brainstorming process. The stakeholders supply the knowledge and experience needed. The method supply the structure necessary to use the experience and knowledge available.
Process results The hazard analysis must give us info that can be used to write Changes and additions to the requirements in order to prevent hazards New tests for the new requirements. Requirements without tests are mostly ignored.
Preliminær hasardanalyse - PHA Tor Stålhane IDI / NTNU
The PHA - 1 The preliminary hazard analysis – PHA - has only one structuring device – the PHA table - and should only be used early in the process. This is reflected in the level of detail used in the table. It is important to include both effects of the hazards and the corresponding preventive actions. This is important since we want to identify barriers and tests.
From Concept to Hazards A concept, as used here, is an idea to a system – somebody saying something like ”We need a system to store and manage all our patient journals”. Somebody says ”Yeah – that’s great, let’s get started!” Our mission is to add the dark side perspective by asking ”How can this create a hazard?”
The PHA table Preventive actions indicates both barriers and tests. Hazard Cause Main effect Preventive action Wrong diagnosis retrieved Wrong diagnosis inserted Kill or hurt patient Double check all patient info : Preventive actions indicates both barriers and tests.
PHA results At this stage We have no code – not even an architecture. Thus, the preventive actions are at a rather high level – for instance plain text. The tests are just descriptions such as ”Check that A occurs in state B” or ”Check that preventive action C prevents problem D in state E”
Hasardanalyse - HazOp Tor Stålhane IDI / NTNU
Process input We can do a HazOp based on one of the following: The functions – functional HazOp. Focus is on “How can this function fail?” The system’s structure – study nodes. Focus is on “How can we get problems at this point in the system?”
The HazOp The HazOp has two structuring devices – the table and the guide words. This makes it more efficient in identifying hazards. However, it also requires more information – the structure of the system, used to identify the study nodes.
The HazOp Table – simple Guide word Study node Consequences Causes Possible solutions This is a simple version – more elaborate versions gives more info and requires more work.
The HazOp Table – advanced Channel /Event Guideword Failure Condition (Hazard Description) Phase Effect of Failure Condition Classification Reference Verification
Study Nodes A study node is a point in the system where we want to focus. Usually it is a point where The system interacts with its environment – e.g. use input. Two or more parts of the system exchange information – e.g. a network connection.
The HazOp Guidewords - 1 The guidewords are used to focus the attention of the participants. The standard guidewords are related to production processes, e.g. none, more, less, as well as. Each guideword is combined with each study node, e.g. “none” and “patient id”
The HazOp Guidewords - 2 The lack of software-related guidewords can be dealt with in at least two ways: Make new, software related guidewords Give new, software-related meaning to the original guidewords. In addition, add guidewords for timing. The second solution has turned out to be the best one.
New guidewords
Standard guidewords - 1 Examples of standard guidewords: Too much Too little Interpretation for a special application: Too much – value above upper limit Too little – value below lower limit
HazOp med standard ledeord Mindre En del av Tolka ledeord for programvareapplikasjon. I prinsippet ny tolking for hvert nytt system Ankomst av signal: mindre -> sporadisk Innhold i signal: mindre -> ufullstendig
Standard guidewords - 2 The software-related interpretation of the guidewords is done through a set of group discussions. For each guideword GW, the discussion starts with a question: what does this GW mean In our system? For this study node?
Ledeord – diskusjon - 1 Spesielle ledeord: Fordeler: Ulemper: tilpasset programvaresystemer raskt å komme i gang Ulemper: behøver ikke passe godt til denne applikasjonen
Ledeord – diskusjon - 2 Tolka ledeord: Fordeler: Ulemper: vil alltid passe godt til denne applikasjonen tolkingsprosessen kan gi ny viktig forståelse og input til analyseprosessen. Ulemper: ekstra jobb før vi kan komme i gang selv om gjenbruk er mulig.
From Hazards to Requirements The movement from hazards to requirements starts with a question: ”Now that we know what can go wrong, how can we prevent it?” The start of the answer is found in the HazOp table - the failure condition. If we can prevent this condition, we can prevent the problem from occurring.
Sources of Barriers The output from the analysis indicates the barriers needed. PHA – ”Preventive action” tells us what to do to prevent the hazard - a barrier. HazOp – ”Failure condition” or ”Effect of failure condition” gives us two opportunities for inserting barriers.
HazOp At this stage, we have specified the Requirements of the subsystems. Algorithms to be used to solve each problem. In object-oriented development we have identified The most important classes Their most important attributes and methods.
FeilMode EffektAnalyse - FMEA Tor Stålhane IDI / NTNU
FMEA - 1 FMEA is a method for systematic checking each system component How can the component fail? What are the consequences for the component? What are the consequences for the system? How can we handle the dangerous event?
Handling eller barriere FMEA - 2 Klasse / Metode Feil modus Feileffekt Handling eller barriere Alvorlighetsgrad Kredittvurdering av kunde Kredittvurdering er for høy Kundene kan bestille mer enn kreditten deres dekker 1) Manuell kontroll når man setter eller forandrer kreditt vurderingen 2) Implementere en funksjon som mottar kreditt vurdering fra eksterne kilder Høy Kreditt vurdering er for lav Kunden får ikke lov til å bruke sin maksimale kreditt Medium
FMEA - 3 The FMEA method: Offers a systematic walk-through of one or more system components. Focuses on preventions – barriers - rather than cures and fixes. Produces an easy-to-use list of dangers and ideas on how they can be removed or handled.
FeilTreAnalyse - FTA Tor Stålhane IDI / NTNU
Hva er et feiltre Et feiltre er et logisk diagram som illustrerer sammenhengen mellom en uønsket hendelse (hazard) i et system og årsakene til denne hendelsen. Fordelen med feiltrær er at de som foretar analysen blir tvunget til å forstå systemet til bunns. Mange svakheter kan derfor bli avdekket allerede mens man utvikler feiltreet.
Når skal vi bruke feiltreanalyse - 1 Feiltreanalyse krever mye innsats og bør brukes med omtanke. Derfor bør vi Bruke feiltreanalyse for å analysere viktige feilmoder Ikke bruke feiltreanalyse til å analysere enhver irriterende feilmode
Når skal vi bruke feiltreanalyse - 2 To mulige formål: Hvor sannsynelig er det at denne feilen inntreffer? Hvordan skal vi fordele ansvaret for påliteligheten i dette systemet?
Elementer i et feiltre Logiske porter Hendelser Overføringer ELLER-port OG-port Betingelsesport – ta hensyn til systemtilstander Hendelser Primærhendelse Sekundærhendelse Overføringer Kommentarer
Utvikle/konstruere feiltrær Spesifiser en uønsket hendelse (hazard) og la denne være topphendelsen i feiltreet. Analyser systemet for å finne de hendelsene som kan være direkte årsak til denne hendelsen og bind de sammen med en logisk port. Gjenta forrige punkt for alle hendelser inntil alle hendelser er dekomponert til primærhendelser.
Bruk av OG og ELLER-porter Ignition fluid is near the fluid Fire breaks out Leakage of flammable fluid Employee is smoking Spark exists . +
Betingelsesport . Operator fails to shutdown system Operator pushes wrong switsh when alarm sounds Alarm sounds Operator fails to shutdown system Operator pushes wrong switch .
Fault Tree Example -1 Wrong code Network error Data destroyed Wrong table inserted Changes outside mtns Transport error Transponder error Decode error Hazard
Fault Tree Example - 2 Table change while not in maintenance mode HW switch failure SW switch failure Switch SW error Switch stuck Wrong info from switch Wrong table inserted Wrong data inserted Faulty manual check Wrong table inserted - A Wrong table inserted - B
Fault Tree Analysis - 2 Based on a software fault tree we can Compute reliability requirements for a component to set requirements on methods and techniques used during development Identify the need for barriers realized in software, hardware or operational procedures
Failure Rate Budgeting Failure rate budgeting starts with the Total acceptable system failure rate – e.g. from the assigned SIL number List of identified hazards from the HazOp. Each hazard is assigned to a subsystem or component.
Tre former for feilbudsjettering Ta utgangspunkt i akseptabel risiko – R. Finn konsekvensen Ki av hver enkelt feil. Sannsyneligheten for hver enkelt feil må være mindre enn R / SKi Finn akseptabel feilrate for systemet. Gi tillatt feilrate til hver komponent via feil-budsjettering. Kvalitativ analyse – stort sett bare brukt for å identifisere behov for barrierer.
Feilbudsjettering Enkel feilbudsjettering, basert på feiltrær der antall hendelser inn til en port er N: For en ELLER-port – gi hver grein et budsjett på l / N. For en OG-port – gi hver grein et budsjett på l1/N
Systemfeil Subsys A Y 2 Y 1 Nettverk Subsys B X 1 X 2 10-5 3*10-6 1.5*10-6 1.7*10-3
Kvalitative ”beregninger” La {Input} være settet av alle input til en port. Da gjelder: For en OG-port: Output = min ({Input}) For en ELLER-port: Output = max({Input}) max(H, L) = H min(H, L) = L
Systemfeil Subsys A Y 2 Y 1 Nettverk Subsys B X 1 X 2 L H
Iterasjoner Feilbedusjettet vil sette krav til feilraten for hver enkelt komponent. Viss kravene er realiserbare og testbare så er vi ferdig med dette nivået av analysen. Ellers må vi sette inn barrierer på dette eller et høyere nivå i treet.
Barrierer Barrierer er ekstra funksjoner som settes inn for å Forhindre feil Redusere konsekvensene av en feil Gi alarm viss en feil holder på å inntreffe
Barrierer - eksempel 10-6 10-3 Algoritme Rimlighetssjekk
HazOp og erfaring Tor Stålhane IDI / NTNU
The Challenge of HazOp You cannot find hazards that are not present in one or more of the participants’ heads when the process starts The method generates no new knowledge and the results are thus critically dependent on the participants’ knowledge and experience.
Our Solution More experience will give a better HazOp. This can be achieved through reuse of experience. In order to make this work, we need processes for: Experience harvesting Experience reuse
Experience Harvesting We need to register experiences related to the application domain. This include: Accidents Near-accidents Unexpected behaviour Malfunctions Other problems
Experience Reuse - 1 Before performing a HazOp or PHA we should search our data base for experiences related to the application domain and ask: Could this happen here? What would be the consequences? How can we safeguard against this?
Experience Reuse - 2 The effect of reuse is to augment the experience available during the HazOp or PHA. Instead of only the experience and knowledge of the participants,we can also draw on the experience from earlier systems in the same and in related application domains.
Experience Reuse - 3 Reuse of a hazard can also imply reuse of Barriers – how did we prevent this from happening last time? Test – how did we check that the barrier worked the last time? Experience can be a major company asset.
Experience Reuse - 4 If we discover a hazard in a system that has already been analyzed we need to do a Post Mortem Analysis – PMA. We need to find answers to the following questions: Why didn’t we find this hazard during HazOp? How can we improve so that we don’t repeat the mistake?
Safety case Safety case er en mellomting mellom FMEA, HazOp og FTA, men uten å bruke FTA’s strenge notasjon. Safety case er nyttige for å kunne diskutere hva vi egentlig hevder å kunne oppfylle av sikkerhetskrav, framstilt på en systematisk måte.
What is a Safety Case A documented body of evidence that provides a convincing and valid argument that a system is adequately safe for a given application in a given environment This method is a tool for managing safety claims Containing a reasoned argument that a system is or will be safe. Manifested as a collection of data, metadata and logical arguments. Issues about safety worries are incorporated into safety case documents as problems to be solved as evidence used in existing safety claim arguments
Safety Case structure Claims about a property of the system or a subsystem Arguments linking the evidence to the claim Evidence is used as basis for the safety argument Inference rules for the argument Claim Argument A Argument B Claims: (Usually about safety requirements for the system) Evidence: (Facts, assumptions, sub-claims) Evidence A1 Evidence A2 Evidence B1
Safety case and PHA PHA, accommodates analysis of a project at an early phase, when we have only high level information Safety Case, supports the definition of a safety requirements specification by forcing the developers to argue that the intended system can be trusted
Combining methods Customer Environment PHA and/ or HazOp Safety Requirements Safety case PHA and/ or HazOp Customer Much of the work done early in conjunction with safety cases tries to identify possible hazards and risks, for instance by using methods like Preliminary Hazard Analysis (PHA) and Hazard and Operability Analysis (HazOp). These are especially useful in combination with Safety Case for identifying the risks and safety concerns that the safety case is going to handle By introducing the use of Safety Case and PHA/HazOp into the RUP inception phase, we have a process where the system safety requirements are maintained in the safety case documents. PHA and HazOp studies on the system specification, together with its customer requirements and environment description, produces hazard identification logs that are incorporated into the safety case as issues to be handled. This also leads to revision of the safety requirements.
PHA eksempel Tor Stålhane IDI / NTNU
Patient Journal Concept Primary Physician Nurse Physician Lab system Patient journal system Top level view – identify system and stakeholders
From Concept to Hazards - 1 Cause Main effect Preventive action Wrong diagnosis retrieved Wrong diagnosis inserted Kill or hurt patient Double check all patient info inserted Stored diagnosis info corrupted Double store and check Wrong patient id used at insertion or retrieval Redundant patient info required for retrieval :
From Concept to Hazards - 2 The results from the PHA can be used to write Requirements – “Double store and check”. Implement the necessary mechanisms Tests – “Redundant patient info required for retrieval”. Test retrieval Inputs to manual procedures – “Double check all patient info inserted”. Write manual info insertion procedure.
Øving - PHA Konsept: system for togkontroll Vi ønsker å få info om posisjonene til alle tog slik at vi til en hver tid vet hvor de er. Dette vil vi bruke til å Hindre kollisjoner Gi informasjon til reisende om ankomsttider, forsinkelser osv.
HazOp eksempel Tor Stålhane IDI / NTNU
Use Case for Patient Journal Medication Physician Diagnosis Primary Physician Documents Orders and responses Nurse Lab system Treatment plan
HazOp for Medication Guideword Study node Consequences Causes Possible solutions Too late Medication info input Lack of medication Info arrives too late Physicians must be prompted for medication info Less Wrong medication Lack of info As well as Conflicting info Check all medication info for consistency More Extra, possibly outdated info Mediation info must be time stamped.
Transmitter / receiver Øving HazOp Train Balise Transmitter / receiver Input Position converter Balise table
Safety Case eksempel Tor Stålhane IDI / NTNU
Customer Information A business needs a database containing information about their customers and the customers’ credit information. An important requirement for such a system would be ensuring the correctness and validity of customers’ credit information. Any problems concerning this information in a system would seriously impact the company’s ability to operate satisfactorily.
PHA table Customer info management Danger Causes Effects Barriers Wrong address inserted Correspondence sent to wrong address Check against name and public info, e.g. “Yellow pages” Update error Database error Testing Wrong credit info Wrong billing. Can have serious consequences Manual check required Consistency check
must be correct when sending invoice Safety Case Credit info DB is sufficiently reliable Credit info must be correct when sending invoice Implementation of credit info consistency check Database testing of manual credit info Insertion and updating credit info is made trustworthy Claim Arguments Evidence
FMEA eksempel Tor Stålhane IDI / NTNU
Ordresystem
FMEA av kreditvurdering Klasse / Metode Feil modus Feileffekt Handling eller barriere Alvorlighetsgrad Kredittvurdering av kunde Kredittvurdering er for høy Kundene kan bestille mer enn kreditten deres dekker 1) Manuell kontroll når man setter eller forandrer kreditt vurderingen 2) Implementere en funksjon som mottar kreditt vurdering fra eksterne kilder Høy Kreditt vurdering er for lav Kunden får ikke lov til å bruke sin maksimale kreditt Medium
FTA eksempel Tor Stålhane IDI / NTNU
Train control - top level
Train control – message content
Øving FMEA Tor Stålhane IDI / NTNU
Transmitter / receiver Øving FMEA - 1 Train Balise Transmitter / receiver Input Position converter Balise table
Øving FMEA - 2 Lag en FMEA for posisjonskonverteringen. Denne enheten har følgende funksjonalitet: Motta balise nummer fra togradioen via optisk fibernett Hent geografisk posisjon fra balisetabellen Vis posisjon til togleder