- 在可靠性和安全性如何相互作用和相互建立方面缺乏明确的、一致的结构,正在产生可避免的冲突和潜在的错误传达,这可能使自动驾驶汽车客户面临不必要的风险并增加过多的系统成本。
- 在车辆传感,控制,动力,制动等方面引入冗余,无需增加成本使乘客或周围的交通更加安全。
如果一味强调安全,而缺乏一个与之匹配的可靠性流程,这相当于为灾难性错误打开了大门。
2018 年 3 月 18 日,世界上首起自动驾驶汽车致行人死亡事故在美国亚利桑那州坦佩市发生。该事件引起了巨大轰动,全球范围内有关本次事故的文章达到了近一万篇,其中大多数均探讨了本次事故对优步(Uber)、自动驾驶汽车、公共道路自动驾驶汽车测试及更广泛社会的影响。
然而,没有多少文章真正探讨了自动驾驶汽车的传感器、软件和平台技术可以从这一悲惨事件中吸取哪些教训。事实上,自动驾驶汽车要想真正实现经济可行性,就必须从事故中吸取经验教训。
无论是为了从坦佩事故中吸取教训,还是真正理解 ISO 26262(道路车辆功能安全标准)的价值,我们其实面临着一个共同的基本挑战:清楚地认识“可靠性”和“安全性”之间的互补和矛盾之处。这并不单纯指字面意义:每位经理都明白,在任何一个软件和硬件设计周期中,流程、权力和责任的划分至关重要:谁做什么工作?向谁报告?何时进行?这些问题的处理方式不同都会导致截然不同的结果。
可靠性是什么?安全性又是什么?这两者在企业环境中又应保持何种关系?从可靠性工程师的视角来看,安全性不过是可靠性的一部分。为什么?因为可靠性团队关注的是故障发生的概率,而安全性团队则关注故障发生且导致灾难性后果(损失、受伤或死亡)的概率。
对于可靠性团队而言,预防并处理这些灾难性事件的概率,仅是他们工作中的一小部分而已。因此,在一个以可靠性为核心的环境中,安全工程师直接接受可靠性团队的管理,且在完整可靠性设计(DfR)流程走完前,不会采取行动。
可靠性和安全性的相互作用
显而易见,安全工程师并不认同这一观点。从他们的视角来看,可靠性分析只能提供特定失效机制(可靠性物理学)或部件(经验学)失效的概率。可靠性分析不会涉及故障发生的具体后果——这会是灾难性的吗?因此,可靠性分析只有深入到系统最下层时,才往往是最有效的。只有这时,分析人员才更能了解系统或用户对故障的反应,从而分析每个故障可能引发的后果严重性。因此,可靠性工程师应当接受安全团队的管理。
可靠性工程师的主要职责是计算故障率和基本故障模式。如果有时这些失败率不过只是数字而已,那么可靠性工程师有什么存在的必要呢?
此外,第三种观点是,可靠性和安全性之间的联系并没有人们想象的那么紧密。我们可以用这两个学科分别“如何解决风扇性能”的问题更好地陈述这两者之间的差别。可靠性工程师会采取可靠性物理分析(RPA)、降速或加速寿命试验(ALT)等措施,确保将风扇在预期环境中的故障率降至目标水平之下。对比之下,安全性工程师则会首先判断风扇故障是否会引发灾难性事件(及这将给系统其他部分带来哪些影响),然后采用“漂移”(drift)增加冗余或调整关键参数(如电流消耗、转速表、噪音)等方式,降低事故的严重程度。
这些不同观点恰好反映了科技公司在“如何处理可靠性和安全性之间关系”方面的犹豫。在一家正在向自动驾驶汽车转型的大型消费者技术公司中,可靠性和安全性团队汇报给同一位总监。另一家自动驾驶领导者公司则将安全性和可靠性团队完全分开,不过这两个部门主管的职位大致类似。我们了解的第三家公司,则是汽车电子领域中一家大力投入自主控制单元研发的中流砥柱。这家公司也将安全性和可靠性团队完全分开,但安全团队主管的职位明显更高,相较而言可靠性团队中职位最高的员工不过是经理或组长,这也反映了这家公司在这两支团队中的“偏重”。
如果无法清晰理解可靠性和安全性之间的相互作用和相互依赖,汽车行业可能会出现一些本可避免的冲突和误解,进而将顾客置于本不必要的风险之中,或导致自动驾驶系统的成本过高,甚至两者兼而有之。如果对可靠性过分缺乏信心,或者公司安全性团队的权力过大,自动驾驶汽车制造商往往会在整个车辆系统中引入大量冗余(包括传感、控制、动力、制动等)。据估算,一辆普通汽车的电子元器件成本超过 12000 美元,这些设计并不一定可以让车内人员或整个交通环境更加安全,但却一定会显著增加成本。
事实上,我们还可以用另一个很好的例子探讨安全性和可靠性之间的差异:那就是如何计算失败率。从 20 世纪 50 年代到 90 年代,在一些电子硬件公司中,大多数可靠性团队都是凭经验来估算故障率。这些手册只是现场故障数据的简单汇总,按零件类型(电阻器、电容器、二极管等等)进行区分。尽管概念简单、使用方便,但多项研究均表明这些手册在实际产品的应用上非常不准确,整体估算结果偏向保守,也往往因此导致预测的故障率过高。
原因很简单——这些手册的分析并不是基于导致失败真正发生的实际原因。进入 21 世纪之后,大多数有经验的可靠性领域专业人员也不再仅仅依靠经验数据来预测失败率。故障手册等过时的方法开始被可靠性物理分析(RPA)和加速寿命测试(ALT)等手段取代,这种趋势在汽车行业中最为明显。直到 ISO 26262 问世。
避免脱节
作为一项功能安全标准,ISO 26262 将根据“用一定方式计算出的故障率”以及“系统所采取的缓解措施”,预测评估车辆的安全完整性等级(SIL)。与可靠性工程师不同,安全性工程师强烈鼓励,甚至直接要求将经验手册作为 SIL 计算的基础。这种脱节的原因很明显——安全性和可靠性分属两个独立团队,也汇报给不同的管理层,双方缺乏最基本的沟通,沟通完全脱节,以至安全工程师仍在使用过时的方法来计算故障率。
如果两个团队之间不能进行合理的平衡,安全性团队往往倾向于给出更高的失败率,并因此要求采取更多的安全分析和安全威胁缓解措施,包括增加冗余等。此外,安全性团队过分专注于经验手册,也会导致他们忽略一些关键故障模式,使得安全威胁缓解机制不再有效。
不过,一切仍有改进的机会。无论主营半导体元件、电子模块还是完整的系统,所有自动驾驶技术价值链上的公司都必须认识到,如果一味强调安全,而缺乏一个与之匹配的可靠性流程,这相当于为灾难性错误打开了大门。
为了避免这种情况,我们第一步可以做的就是打破可靠性和安全性团队的物理障碍,将这两支团队放在同一支领导团队之下。双方应同意共同实施最佳做法,包括使用最先进的模拟、建模及可靠性物理学等,为适当且有效的风险识别和缓解奠定基础。
An overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that could be catastrophic.
On March 18, 2018, the first pedestrian fatality due to the operation of an autonomous vehicle occurred in Tempe, Arizona. Since then, almost 10,000 articles have been published on this accident, with most of them espousing an opinion on what it all means for future of Uber, autonomous vehicles, public-roads AV testing, and even the larger society.
What is missing from this cauldron of debate is the lessons learned that designers of autonomous sensor, software and platform technologies can extract from this tragic event. Learning from it will be pivotal to the financial success of autonomous vehicles.
A fundamental challenge in learning from the Tempe fatality and in determining the value of ISO 26262 (the functional safety standard for road vehicles) is in identifying the complimentary and contradictory roles of reliability and safety. This is not a matter of semantics: Every manager realizes that process, authority, and responsibility are the core of every software and hardware design cycle. Who does what, who reports to whom and when they do it can it result in dramatically different outcomes.
What is reliability, what is safety, and how should they relate to each other in a corporate environment? From the perspective of reliability engineers, safety is a subset of reliability. Why? While reliability focuses on the probability that a failure will occur, safety assumes the probability that a failure will occur and result in a catastrophic (loss, injury, or death) event.
Catastrophic events are just a small portion of the overall outlook being managed and tracked by the reliability team. Thus, in a reliability-centric world, safety engineers are managed by the reliability team and do not act until a thorough design-for-reliability (DfR) activity is complete.
Reliability and Safety interact
As one would expect, safety engineers do not share the same vision. From their viewpoint, reliability analyses only provide probability of failure for a particular failure mechanism (reliability physics) or part (empirical approach). Reliability analyses have no context as to the consequence of failure—will it be catastrophic? Such analyses are therefore most effective when performed at the lowest level of the system. Because consequences are only clear at the system-level, where the response of the system or the user to the failure can be considered, reliability engineers should report into the safety team.
The key function of reliability engineers is to calculate failure rate and basic failure modes. And since, sometimes, these failure rates are only numbers, why have a reliability engineer at all?
A third viewpoint is that reliability and safety are not as related as one would expect. A prime example of this philosophy is how the two disciplines would address fan performance. From a reliability perspective, the actions might be to ensure the fan meets failure rate goals for the expected environment, either through reliability physics analysis (RPA), derating, or accelerated life testing (ALT).From a safety perspective, the actions might be to determine if fan failure would induce a catastrophic event (how it interacts with the rest of the system) and then introduce potential mitigations, such as redundancy or prognostics using drift or change in key parameters (current draw, tachometer, noise).
These different viewpoints highlight the uncertainty among technology companies on how to handle reliability and safety. One major consumer technology company that is transitioning to autonomous vehicles has Reliability and Safety reporting into the same Director. A second company, a leader in the autonomous field, has Safety and Reliability reporting into two different organizations, even though the leaders in both departments have roughly equivalent titles. A third company, a mainstay in automotive electronics that is aggressively targeting autonomous control units, also has Safety and Reliability in two different organizations, but clearly has a favorite through the numerous executive titles assigned to Safety (while the highest reliability staffer is either Manager or Leader).
Without a clear and consistent construct in how reliability and safety interact and build upon each other, the automotive industry is creating avoidable conflict and potential miscommunication that will either put customers under unnecessary risk, create autonomous systems that are excessively expensive, or both. One autonomous vehicle manufacturer had such uncertain confidence in reliability, or such unlimited authority of the safety team, that it introduced redundancy throughout the vehicle (including sensing, control, power, braking, etc.). Given that the average car has, by some estimates, over $12,000 of electronics, this intro-duces significant costs without necessarily making the occupants, or the traffic around them, that much safer.
A perfect example of this issue is the divergence between safety and reliability in how to calculate failure rates. From the 1950s through the 1990s, most reliability practitioners in electronic hardware organizations used empirical handbooks to calculate failure rates. These handbooks were simply aggregations of field failure data, sorted by part technology (resistor, capacitor, diode, etc.). While simple in concept and execution, repeated studies demonstrated that these handbooks were wildly inaccurate when used on actual product, with the error leaning towards the conservative—over-predicting failure rate.
The reason was straightforward - these handbooks were not based on the actual mechanisms that cause failure. Fast forward to the 21st century and most skilled reliability practitioners no longer rely exclusively on empirical field data to predict failure rates. Reliability physics analysis (RPA) and accelerated life testing (ALT) replaced these outmoded approaches and nowhere was this truer than in the automotive industry. Until ISO 26262 came along.
Avoiding the disconnect
As a functional safety standard, ISO 26262 requires the computation of failure rates and the appropriate mitigations to predict the safety integrity level (SIL).And the safety community, unlike the reliability engineers, strongly encourage or even require empirical prediction handbooks to be the basis of SIL calculations. This disconnect is driven by the lack of a universal construct between reliability and safety. Creating separate organizations reporting into separate management has led to a breakdown in communication, causing safety engineers to use outmoded approaches for failure rate calculations.
In addition, without a balance between the two groups, safety teams will tend to prefer higher failure rates, which requires additional safety analyses and safety mitigations including redundancy. Safety’s focus on simple handbook calculations will also result in overlooking critical failure modes, such that safety mitigations are no longer effective.
There is still an opportunity for improvement. Players in autonomous technology, from semiconductors to electronic modules to overall systems, must realize that an overemphasis on safety without a robust and equivalent reliability process and organization will result in errors that will be difficult to untangle.
A good first step is to make sure that reliability and safety are within the same organization, reporting to a neutral observer. Both sides should agree to implement best practices, including use of state-of-the-art simulation and modeling and reliability physics to lay the ground work on appropriate and effective risk identification and mitigation.
Author: Craig Hillman
Source: SAE Automotive Vehicle Engineering Magazine
等级
打分
- 2分
- 4分
- 6分
- 8分
- 10分
平均分
- 作者:Craig Hillman
- 行业:汽车
- 主题:安全性