Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection reset by peer #40055

Closed
jsarrelli opened this issue May 6, 2024 · 5 comments
Closed

Connection reset by peer #40055

jsarrelli opened this issue May 6, 2024 · 5 comments
Assignees
Labels
ARM customer-reported Issues that are reported by GitHub users external to the Azure organization. Mgmt This issue is related to a management-plane library. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@jsarrelli
Copy link

jsarrelli commented May 6, 2024

Hi Team,
Im getting Connection reset by Peer errors on the SDK.
I don't know how to properly handle these kind of errors why are they not being retried by the SDK internally.
I've managed to add a retry policy when calling the "listAsync" methods directly, but in other scenarios, like the one described below the exception occurs while it tries to refresh the objects internally

Stack Trace

Scenario 1 - Azure VM:

reactor.core.Exceptions$ReactiveException: io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
	at reactor.core.Exceptions.propagate(Exceptions.java:396)
	at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:98)
	at reactor.core.publisher.Mono.block(Mono.java:1742)
	at com.azure.resourcemanager.compute.implementation.VirtualMachineImpl.refreshInstanceView(VirtualMachineImpl.java:449)
	at com.azure.resourcemanager.compute.implementation.VirtualMachineImpl.instanceView(VirtualMachineImpl.java:1910)
	at com.azure.resourcemanager.compute.implementation.VirtualMachineImpl.powerState(VirtualMachineImpl.java:1928)
	at rocks.coal.snapshotters.azure.VmListSnapshotter.$anonfun$generateSnapshots$9(VmListSnapshotter.scala:132)
	at akka.stream.impl.fusing.Map$$anon$1.onPush(Ops.scala:58)
	at akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:557)
	at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:511)
	at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:403)
	at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:650)
	at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:521)
	at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
	at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
	at akka.stream.impl.fusing.ActorGraphInterpreter$$anonfun$receive$1.applyOrElse(ActorGraphInterpreter.scala:818)
	at akka.actor.Actor.aroundReceive(Actor.scala:537)
	at akka.actor.Actor.aroundReceive$(Actor.scala:535)
	at akka.stream.impl.fusing.ActorGraphInterpreter.aroundReceive(ActorGraphInterpreter.scala:716)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:579)
	at akka.actor.ActorCell.invoke(ActorCell.scala:547)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
	at akka.dispatch.Mailbox.run(Mailbox.scala:231)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
		... 27 common frames omitted
Caused by: io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

Scenario 2 - HealthCheck:

reactor.core.Exceptions$ReactiveException: io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer
	at reactor.core.Exceptions.propagate(Exceptions.java:396)
	at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:98)
	at reactor.core.publisher.Mono.block(Mono.java:1742)
	at com.azure.resourcemanager.recoveryservicessiterecovery.implementation.ReplicationVaultHealthsClientImpl.getWithResponse(ReplicationVaultHealthsClientImpl.java:192)
	at com.azure.resourcemanager.recoveryservicessiterecovery.implementation.ReplicationVaultHealthsClientImpl.get(ReplicationVaultHealthsClientImpl.java:209)
	at com.azure.resourcemanager.recoveryservicessiterecovery.implementation.ReplicationVaultHealthsImpl.get(ReplicationVaultHealthsImpl.java:42)
	at rocks.potash.azure.RecoveryServicesClient.replicationVaultHealthErrors(RecoveryServicesClient.scala:59)
	at rocks.coal.snapshotters.azure.RecoveryServicesSnapshotter.$anonfun$generateSnapshots$2(RecoveryServicesSnapshotter.scala:62)
	at akka.stream.impl.fusing.MapAsyncUnordered$$anon$30.onPush(Ops.scala:1427)
	at akka.stream.impl.fusing.GraphInterpreter.processPush(GraphInterpreter.scala:557)
	at akka.stream.impl.fusing.GraphInterpreter.processEvent(GraphInterpreter.scala:511)
	at akka.stream.impl.fusing.GraphInterpreter.execute(GraphInterpreter.scala:403)
	at akka.stream.impl.fusing.GraphInterpreterShell.runBatch(ActorGraphInterpreter.scala:650)
	at akka.stream.impl.fusing.GraphInterpreterShell$AsyncInput.execute(ActorGraphInterpreter.scala:521)
	at akka.stream.impl.fusing.GraphInterpreterShell.processEvent(ActorGraphInterpreter.scala:625)
	at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$processEvent(ActorGraphInterpreter.scala:800)
	at akka.stream.impl.fusing.ActorGraphInterpreter.akka$stream$impl$fusing$ActorGraphInterpreter$$shortCircuitBatch(ActorGraphInterpreter.scala:787)
	at akka.stream.impl.fusing.ActorGraphInterpreter.preStart(ActorGraphInterpreter.scala:778)
	at akka.actor.Actor.aroundPreStart(Actor.scala:548)
	at akka.actor.Actor.aroundPreStart$(Actor.scala:548)
	at akka.stream.impl.fusing.ActorGraphInterpreter.aroundPreStart(ActorGraphInterpreter.scala:716)
	at akka.actor.ActorCell.create(ActorCell.scala:643)
	at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:513)
	at akka.actor.ActorCell.systemInvoke(ActorCell.scala:535)
	at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:295)
	at akka.dispatch.Mailbox.run(Mailbox.scala:230)
	at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
	at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:373)
	at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1182)
	at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1655)
	at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1622)
	at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:165)
	Suppressed: java.lang.Exception: #block terminated with an error
		at reactor.core.publisher.BlockingSingleSubscriber.blockingGet(BlockingSingleSubscriber.java:100)
		... 30 common frames omitted
Caused by: io.netty.channel.unix.Errors$NativeIoException: recvAddress(..) failed: Connection reset by peer

Code Snippet

import akka.stream.scaladsl.Source
import com.azure.resourcemanager.compute.models.VirtualMachines

case class VirtualMachineClient(underlying: VirtualMachines) {

  def listInstances() = Source.fromPublisher(underlying.listAsync())
}



object AzureVms extends BufaTest {

  azureClient.virtualMachines.listInstances().runForeach(vm => logger.info(s"VM:${vm.name()},${vm.powerState()}"))

}

libraryDependencies ++= Seq(
  "com.azure" % "azure-core" % "1.48.0",
  "com.azure" % "azure-identity" % "1.12.0",
  "com.azure" % "azure-core-http-netty" % "1.15.0",
  "com.azure.resourcemanager" % "azure-resourcemanager" % "2.38.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-appservice" % "2.38.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-search" % "2.38.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-recoveryservices" % "1.2.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-recoveryservicesbackup" % "1.3.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-resourcehealth" % "1.0.0",
  "com.azure.resourcemanager" % "azure-resourcemanager-recoveryservicessiterecovery" % "1.1.0"
)

final class RecoveryServicesClient(
                                    recoveryServicesManager: RecoveryServicesManager,
                                    recoveryServicesBackupManager: RecoveryServicesBackupManager,
                                    siteRecoveryManager: SiteRecoveryManager,
                                  )(implicit val context: ExecutionContext, actorSystem: ActorSystem) {

  def replicationVaultHealthErrors(resourceGroupName: String, resourceName: String): Future[List[HealthError]] =
    sourceWithRetry {
      Source(
        siteRecoveryManager
          .replicationVaultHealths()
          .get(resourceName, resourceGroupName)
          .properties()
          .vaultErrors()
          .asScala
          .toList
      )
    }.runFold(List.empty[HealthError])(_ :+ _)

  private def sourceWithRetry[T](source: Source[T, _]) = RestartSource
    .onFailuresWithBackoff(RestartSettings(1.second, 10.seconds, 0.1).withMaxRestarts(5, 1 minute))(() => source)

}

Setup (please complete the following information):

  • OS: Linux
  • IDE: IntelliJ
  • Library/Libraries:
  • Java version: 17

Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [-] Bug Description Added
  • [-] Repro Steps Added
  • [-] Setup information Added
@github-actions github-actions bot added ARM customer-reported Issues that are reported by GitHub users external to the Azure organization. Mgmt This issue is related to a management-plane library. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels May 6, 2024
Copy link

github-actions bot commented May 6, 2024

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@weidongxu-microsoft
Copy link
Member

weidongxu-microsoft commented May 7, 2024

I think SDK won't automatically retry on TCP error.

Is there any READ/WRITE timeout, or proxy in the middle that could close the TCP connection?

What is the typical time from the start of a connection, to the interruption of the connection by "Connection reset by Peer"?

@jsarrelli
Copy link
Author

There is no timeout or proxy.
We are opening new connections every couple of minutes. And those connections might last one minute at most. Most of them succeed but others just fail with these errors.

When it fails it does within the first 5 seconds or so.

@weidongxu-microsoft
Copy link
Member

weidongxu-microsoft commented May 8, 2024

@jsarrelli
Copy link
Author

You were right!

I have a custom retry policy applied which was misconfigured!
Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARM customer-reported Issues that are reported by GitHub users external to the Azure organization. Mgmt This issue is related to a management-plane library. needs-team-attention This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

3 participants